Unsupervised Feature Selection through Gram-Schmidt Orthogonalization — A Word Co-occurrence Perspective Deqing Wanga,∗, Hui Zhanga , Rui Liua , Xianglong Liua , Jing Wangb a SKLSDE, b School
School of Computer Science, Beihang University, Beijing, China 100191 of Economics and Management, Beihang University, Beijing, China 100191
Abstract Feature selection is a key step in many machine learning applications, such as categorization, clustering etc. Especially for text data, the original documentterm matrix is high-dimensional and sparse, which affects the performance of feature selection algorithms. Meanwhile, labeling training instance is timeconsuming and expensive. So unsupervised feature selection algorithms have attracted more attention. In this paper, we propose an unsupervised feature selection algorithm through Random Projection and Gram-Schmidt Orthogonalization (RP-GSO) from the word co-occurrence matrix. The RP-GSO algorithm has three advantages: (1) it takes as input dense word co-occurrence matrix, avoiding the sparseness of original document-term matrix; (2) it selects “basis features” by Gram-Schmidt process, guaranteeing the orthogonalization of feature space; and (3) it adopts random projection to speed up GS process. Extensive experimental results show our proposed RP-GSO approach achieves better performance comparing against supervised and unsupervised feature selection methods in text classification and clustering tasks. Keywords: Feature Selection, Random Projection, Gram-Schmidt Orthogonalization, Basis Features, Word Co-occurrence Matrix
∗ Corresponding
author Email addresses:
[email protected] Tel:8610-82338084 Fax:8610-82339924 (Deqing Wang),
[email protected] (Hui Zhang),
[email protected] (Rui Liu),
[email protected] (Xianglong Liu),
[email protected] (Jing Wang)
Preprint submitted to Elsevier
August 6, 2015
1. Introduction High-dimensionality and sparseness are two characteristics of text vector space. For instance, the dimensionality of a moderate-sized text corpus can reach up to tens or hundreds of thousands, but each document only contains several 5
hundred words. The high dimensionality of feature space will cause the “curse of dimensionality”, increase the training time, and affect the predicting accuracy of models in many machine learning applications, such as classification [1, 2, 3, 4] and clustering [5, 6, 7, 8, 9]. Therefore, feature selection techniques are proposed to reduce dimensionality under the premise of guaranteeing the performance of
10
models. Feature selection (FS) methods can be divided into supervised and unsupervised based on whether the label information is available [10, 11, 12, 13, 14]. For text classification tasks, many typical supervised feature selection approaches are widely used, e.g., chi-statistics (χ2 ), information gain [3, 2], t-test [15],
15
Gram-Schmidt-Orthogonalization based Forward Selection [16, 17], and supervised MCFS [12]. However, it is time-consuming and expensive to label training samples, especially for large scale text data, which is impossible to acquire the labeled information of each document. To solve with this problem, more and more unsupervised feature selection methods are developed [7, 8, 5, 12, 13, 14],
20
such as Laplacian Score [5], unsupervised MCFS [12], and Joint Laplacian Feature Weights [14]. Recently, some researchers have designed provably efficient algorithms to statistically recover topic model. For example, Donoho and Stodeen [18] proposed the “Separability” concept in non-negative matrix factorization (NMF), which
25
assumes that every topic contains at least one anchor word that has non-zero probability only in that topic [19]. The assumption demonstrates that there exists some basis vectors residing in and spanning the original space, so some effective algorithms can be employed to detect these basis vectors. Alone this line, some topic recovery algorithms are proposed in text area [20, 6, 21]. Actu-
30
ally, these algorithms have one common point: some “anchor words” or “novel
2
0
10
X(2909,:)
The term weighting value
X(2931,:)
−1
10
−2
10
1000
2000
3000
4000
5000
6000
7000
8000
9000
The document identification
(a) partially overlapping in original document-word space −3
10
Q(2909,:) Q(2931,:)
−4
The term weighting value
10
−5
10
−6
10
−7
10
−8
10
−9
10
500
1000
1500
2000
2500
the word identification
(b) wholly overlapping in word co-occurrence space Figure 1: The comparative curves of two row vectors between term-document matrix X and word co-occurrence matrix Q , the rows are randomly selected from Reuters corpus, i.e., the 2909-th and 2931-th.
words” are first detected before recovering model. If the corpus is represented as a term-document matrix, then the vectors corresponding to these anchor words are linear independent, and the rest can be regards as linear combination of them. The experiments in Refs [20, 6, 21] have verified that the assumption is
3
35
reasonable in real-world corpora. Inspired by “anchor words”, we observe that we can employ some algorithm to detect basis vectors from a set of vectors as our “basis features”, which span the original space. In the above approaches, they detected “anchor words” by different algorithms from term-document matrix X or word co-occurrence matrix Q, noting
40
that X or Q is the general notation without specific normalization, because different algorithm needs different normalization. We observe that the transformation from X to Q is also favorable for feature selection algorithms, because the transformed space is more dense and suit to find orthogonal basis vectors. As shown in Figure 1, we randomly selected two row vectors from matrix X and
45
Q of Reuters corpus, and the row identifications are 2909 and 2931, respectively. Figure 1(a) shows the scatter diagram of two vectors in term-document space, i.e., X(2909, :) and X(2931, :), in which the green points stands for X(2931, :) , and the red points are X(2909, :). It is obvious that two row vectors are partially overlapping in original term-document space. Figure 1(b), however,
50
demonstrates the change after we transform X into Q, that is, two vectors are almost wholly overlapping in word co-occurrence space. Therefore, the matrix Q not only avoids the sparseness of matrix X, but also conquers too narrow projection distribution during orthogonalization process in our algorithm. Inspired from the “separability” in NMF, and from the “orthogonaliza-
55
tion” in matrix, we propose an unsupervised feature selection algorithm through Random Projection and Gram-Schmidt Orthogonalization (RP-GSO), which detects basis features from normalized empirical word co-occurrence matrix using Gram-Schmidt orthogonalization [16, 22], and employs random projection to speed up the above orthogonalization process. In this paper, we made the fol-
60
lowing contributions: • The proposed RP-GSO algorithm takes as input dense word co-occurrence matrix, rather than sparse document-term matrix. The modification of input not only avoids the sparseness of matrix, but also conquers too narrow projection distribution during GS orthogonalization process.
4
65
• Our algorithm applies linear dependence and orthogonalization of vector space to detect basis vectors, which can be regards as “basis features” . • Extensive experiments in real-world text classification and clustering tasks demonstrate the superiority of our proposed algorithm, compared against 10 kinds of state-of-art unsupervised and supervised feature selection al-
70
gorithms. The rest of this paper is organized as follows: in Section 2, we briefly describe the related work about FS metrics; In Section 3, we propose our new feature selection algorithm based on random projection and Gram-Schmidt orthogonalization; In Section 4, we describe the data sets, methodology, and performance
75
measures used in our experiments; Section 5 presents the experimental results and shows the effectiveness of our new approach; Finally, we conclude this paper with future work. 2. Related Work To deal with large scale text corpus, many feature selection approaches have
80
been proposed. Feature selection methods can be divided into supervised and unsupervised methods based on whether the label information is available [23, 12]. For supervised feature selection methods, readers can refer to the papers [3, 2, 23]. Here we briefly review recent feature selection approaches, especially from topic model perspective.
85
Recently, some spectral-clustering based FS approaches have been proposed, e.g., Laplacian Score [5], MCFS [12], and Joint Laplacian feature weights learning [14]. The Laplacian score is based on the observation that two data points probably belong to the same category if they are close to each other, and it employs a nearest neighbor graph to select features which respect local geometric
90
structure of data [5]. Yan and Yang [14] extended Laplacian Score, constrained indicator vector to nonnegative and l2 -norm, and proposed JLFWL algorithm. In many real-life corpora, data has multi-cluster structure property, so Cai et al. [12] proposed unsupervised MCFS algorithm to well preserve this property. 5
Besides manifold learning algorithms, topic model is also used to extend 95
feature space of short texts. Topic model was first proposed by Blei et al. [24], which assumes that one document is mixed by serval or tens of topics, and the words in the document are drawn from a topic word matrix. So topic model not only reduces dimensionality, but also enriches short texts by extending high-level features. Alone this line, many topic-model based feature extension algorithms
100
have been proposed [25, 26, 27], which first apply Latent Dirichlet Allocation (LDA) [24] to learn topic-word distribution from a larger external unlabeled text corpus (auxiliary data), and then infer topic distribution of each document based on the model, last use the latent topics as feature to enrich short text. For example, Phan et al. [27] used Wikipedia data to enrich short text by introducing
105
new features (latent topics). But the disadvantage is also obvious, that is, the introduction of new features is empirical, and is lack of theoretical support. Therefore, the subsequent approaches [25, 26] both possess this shortness. It is also noting that the exploration of external data is an inherently noisy operation and there is a risk that certain portion of the auxiliary data may in fact be
110
semantically unrelated to the original short texts. Actually, researchers in theoretical computer science have been focused on feature selection algorithms from matrix theories perspective. For example, early in 1989, Chen et al. [16] first applied Gram-Schmidt orthogonalization projection in feature selection area, and proposed the supervised forward selec-
115
tion method. Recently, many theoretical experts have designed some provably efficient algorithms based on non-negative matrix factorization (NMF) to statistically recover the model parameters in polynomial time [19, 20, 6, 21]. Donoho and Stodeen first proposed “Separability” concept in non-negative matrix factorization, which assumes that every topic contains at least one part word that
120
has non-zero probability only in that topic [18]. Actually, the assumption implies that there exists some basis vectors residing in and spanning the original space. Therefore, the set of column vectors of term-document matrix X is linear dependent, and we can detect the basis vectors. Alone this line, Arora et al. proposed topic discovery algorithm based on anchor words and word co6
125
occurrence probability matrix with provable guarantees [20]. It is noting that the algorithm is suitable to topic recovery, but is not compatible for feature selection because of its l1 normalization. Our proposed method is inspired by their anchor words, but we regularize the vector space into a unit sphere space by l2 normalization.
130
Based on anchor words, Kumar et al. directly find extreme rays of the conical hull of a finite set of vectors, which form word-document matrix of the corpus [6]. The shortness of this method is that the detected extreme rays are not orthogonal, and then they will affect the performance when used in classification and clustering tasks. Ding et al. proposed an algorithm based on cross-document
135
word-frequency patterns, and presented two efficient methods to detect “novel words”, i.e., data-dependent and random projections [21], which achieved similar performance in their experiments. The random projection method employed some random directions drawn uniformly iid over the unit sphere in RM , and it is not suitable to large scale data because its random direction depends on the
140
number of instances. And it detected the novel words which are probably not orthogonal. Therefore, we should pick up orthogonal novel words carefully from matrix perspective. And it is the reason why we adopt Gram-Schmidt process, which applies projection to make the selection of novel words orthogonal. From word-word co-occurrence matrix perspective, many minimum redun-
145
dancy feature selection methods are proposed [28, 29, 30]. These methods often use greedy search to select a feature which has minimum redundancy with existing feature subset, and they measure minimum redundancy by mutual information. Our proposed method, however, adopts project to avoid redundancy between features, which guarantees not only minimum redundancy but orthog-
150
onalization between features.
3. RP-GSO: Random Projection & Gram-Schmidt Orthogonalization Given a corpus of documents D, and D are composed of M documents with a vocabulary V size of W . For each document di (1 ≤ i ≤ M ) in D, let X.i be 7
the column vector in RW , i.e., the classic “bags of words” modeling paradigm, 155
such that the j-th (1 ≤ j ≤ W ) entry Xji is the term frequency of word i(1 ≤ i ≤ W ) in di , ndi be the length of di . Similar to Arora et al. [31], we p use X¯.i = X.i / nd (nd − 1) as the normalized vector of X.i , and the diagonal
matrix is X˜.i = Diag(X.i )/nd (nd − 1). Then we collect all the column vectors
¯ and sum all X˜.i to get the diagonal matrix X¯.i to form a large sparse matrix X, 160
˜ then we get word co-occurrence matrix Q by Eq. 1. X, Q=
1 ¯ ¯T ˜ (X X − X), M
(1)
where Q is a W × W square matrix. In the infinite data case where we collect infinitely many documents, Q would be the second-monent matrix of X. Meanwhile, from probabilistic perspective, each entry Qij stands for the empirical co-occurrence probability between word i and word j in the corpus, i.e., 165
Qij = p(wi, wj). After removing scalar parameter
1 M,
the matrix −Q is similar
to graph Laplacian matrix L in spectral learning if we consider each word as each document, and consider co-occurrence frequency as a distance similarity function between words. The key concept of NFM-based topic model recovery is separability assump170
tion [18], which assumes that there exists some words perfectly indicating the corresponding topic. Separability assumption.
The word-topic matrix βW ×K is p-separable
for p > 0 if for each topic k, there is some word i such that βi,k ≥ p and βi,k′ = 0 for k ′ 6= k. 175
Suppose θK×M is the weight matrix whose column vectors are the mixing weights over K topics. From NMF perspective, we can deduce that X = βθ and word co-occurrence Q = XX T = βθθT β T . If the assumption is satisfied, it shows the existence of “anchor words” [20, 21] that are unique to each topic, so all columns of X or Q reside in a space generated by a subset of K columns of
180
X or Q. Here is a brief mathematical derivation. Because word-topic matrix β of size W ×K satisfies separability assumption, then there exists a permutation matrix P of size W × W such that first K rows 8
of product P β form diagonal matrix of size K × K. So there are K columns of θ come from the columns of X. That is, all columns are linear combination 185
of these K columns, we can employ some algorithm to detect these column vectors as our “basis features”. This is the theoretical foundation of our new feature selection algorithm. In fact, we can also obtain the same derivation for word co-occurrence matrix Q. In the above introduction, we have observed that the transformation from X to Q is favorable for feature selection algorithms,
190
because the transformed space is more dense and suit to find orthogonal basis vectors. So we take dense word co-occurrence matrix Q as the input of our RP-GSO algorithm, rather than sparse word-document matrix X. In the infinitely data case, i.e., when M tends to infinity, the convex hull of columns in X or Q will be a simplex where the vertices of this simplex are
195
the anchors words. However, we only have a finite number of documents in our real-world corpus, so our empirical X or Q are only an approximation to their expectation, respectively. The RP-GSO algorithm is implemented by Matlab, and the pseudocodes are shown in Fig. 2. In Step 1, we firstly parse each document di and get the number
200
of times each word appearing in di , i.e., X.i , then normalize X.i into X¯.i , and ¯ which composes of W rows and M columns, last we obtain the large matrix X, each column stands for a document. Then in Step 2, we calculate the word-word co-occurrence matrix Q through matrix operations. Step 3 is very important for our feature selection algorithm, because it employs row l2 normalization to
205
get unit vectors, so that all the vectors are distributed in a unit sphere space, which differs with l1 normalization used in Arora’s algorithm [20], because our algorithm focuses on feature selection, whereas Arora’s algorithm focuses on word-topic matrix recovery, and l2 normalization makes no sense for his method. In Steps 4 and 5, we generate a W × T random projection matrix R to reduce
210
¯ (the row-l2 normalized Q ) from W × W to W × T , where the dimension of Q T ≪ W . Last, in step 6, we apply the Gram-Schmidt orthogonalization [22] to detect the basis vectors of Qrp , in which each column vector may be written as a linear combination of these basis vectors. The RP-GSO algorithm can be 9
The RP-GSO Algorithm Input: D: The text corpus, containing M docs and W words. K: The number of basis vectors (basis features). T : The column dimension of random matrix R, and T ≪ W Output: S: The K feature indices (indices of basis features). Procedure: RP-GSO(D, K, T ) 1.
¯ X]= ˜ ¯ is W × M [X, generateWord Doc matrix(D); %X
2.
¯ X); ˜ %Q is W × W Q= generateWord Word matrix(X,
3.
¯ Q=normr(Q);
4.
R = generate Random Projection Matrix(T ); ¯ ∗ R; =Q
//according to Eq. 2
// Qrp W ×T and T ≪ W
5.
Qrp
6.
S=gram schmidt orth(Qrp , K): 6.1 S={di } s.t. di is the farthest point from the origin; 6.2
for i=1 to K-1 Let dj be the point in set V consisting of row vectors of Qrp that has the largest distance to span(S); S ← S ∪ dj ;
7.
return S;
Figure 2: The Matlab psedocodes of RP-GSO Algorithm.
parallelized for matrix operations, it can handle large scale data, especially for 215
the M ≫ W case. Here we give the details of implementation of the RP-GSO algorithm. The first is the random projection preprocess before GS-Orthogonalization, which employs Johnson and Lindenstrauss transform [32] to reduce text dimensionality dramatically and makes our algorithm more efficient for large-scale data. As we
220
know, for text mining domain, it is time-consuming if we directly apply our new
10
algorithm to the original high-dimensional text vector space. Therefore, we first ¯ in order to apply random projection to reduce the dimensionality of matrix Q save computing time and to improve the efficiency. Actually, random projection employs an assertion of Johnson and Lindenstrauss, which asserts that any set of 225
n points in d-dimensional Euclidean space can be embedded into k-dimensional Euclidean space where k is logarithmic in n and independent of d, so that all pairwise distances are maintained within an arbitrarily small factor [32]. So Achlioptas et al. [32] proposed one construction of such embedding with the property that all elements of the projection matrix belong in {−1, 0, +1},
230
which is well suited for database environments. The proof of the construction ¯ we directly apply the refers to Achlioptas’s paper. For the W × W matrix Q, construction proposed by Achlioptas [32] to generate a new W × T matrix R, where T ≪ W and each entry Rij are independent random variables from the following probability distribution by Eq. 2: −1 with probability 1/6, √ Rij = 3 × 0 with probability 2/3, +1 with probability 1/6,
235
(2)
¯ is a W × T matrix, and its dimensionality is reduced from Then Qrp = QR W × W to W × T . The second is the core algorithm to detect basis vectors from a set of vectors, because we need some algorithm to select basis vectors of the set of all row vectors in Qrp . As we know, the convex hull of the rows in Qrp will be a simplex
240
where the vertices of this simplex correspond to the “basis features”, if we have infinitely many documents. Since we only have a finite number of documents, the rows of Qrp are only an approximation to their expectation. Here we refer to the definition and theorem in reference [20], we can obtain the following result: There is a combination algorithm that runs in time O(W 2 + W K/ǫ2 )
245
and outputs a subset of {d1 , . . . , dW } of size K that O(ǫ/γ)-covers the vertices provided that 20Kǫ/γ 2 < γ, where γ > 0 and ǫ > 0. The detailed proof can be found in reference [20]. 11
In mathematics, there are many decomposition methods for basis vectors detection, and Gram-Schmidt process is a method for orthonormalising a set 250
of vectors in an inner product space, most commonly the Euclidean space Rn . For vectors u and v, we define the projection operator by proju (v) =
u,
where < u, v > denotes the inner product of u and v. This operator projects the vector v orthogonally onto the line spanned by vector u. Therefore, for K vectors {v1 , v2 , . . . , vK }, we can obtain orthogonal sequence vectors {u1 , u2 , . . . , uK } by 255
Eq.3 uk = vk −
k−1 X
projuj (vk )
(3)
j=1
Based on the above orthogonalization process, we first detect the farthest point from the origin in Qrp (denoted as v1 ) as the first basis vector and add its normalized vector u1 according to Eq.3 into S (1) . Then, for each vector of the rest, we remove the component in direction ui (1 ≤ i ≤ K − 1), and pick up the 260
vector with maximum distance from origin at each step t (1 ≤ t ≤ K − 1). After
that, we obtain a subset S (K) , consisting of K rows selected from matrix Qrp . It should be noting that our method differs with the process used in Fast Anchor Detection algorithm [20], which modifies the origin from 0 to the farthest point in order to find anchor words with less frequencies. In this paper, we directly 265
employ Gram-Schmidt process [22] to detect orthogonal basis vectors as our “basis features”.
4. Setups In this section, we introduce text corpora, methodology and performance measures of text classification and clustering tasks. 270
4.1. Data Sets Reuters-21578 1 : The Reuters corpus is a widely used benchmark collection. According to the ModApte split, we get a collection of 52 categories after re1 Available
at http://ronaldo.cs.tcd.ie/esslli07/sw/step01.tgz.
12
Table 1: The characteristics of real-world text collections. Dataset
#Instances
#Classes
#Vocab
Density of Matrix Doc-word
Word Co-occurrence
Reuters
9100
52
2950
1.27%
41.02%
20 Newsgroup
19914
20
3089
1.60%
91.46%
moving unlabeled documents and documents with more than one class label, which contains 6,532 training documents and 2,568 test documents, respective275
ly. Then we filter words whose document frequency is less than 15, and we merge training & testing set into one set with M = 9100, W = 2950, C = 52. Note that Reuters-21578 is a very skewed data set, in which the majority category (earn) accounts for 43% of the whole instances, whereas the top-ten minority categories only contain several instances in each category. The number of
280
features in our experiments is set as 20, 50, 100, 200, 500, 1000, 2000, and all (2950), respectively. The last one is used as comparable baseline. 20Newsgroup 2 : The Newsgroup corpus consists of about 20K documents, which are uniformly distributed in 20 categories. We also filter the words with document frequency less than 100, and last we get a corpus with M = 19914,
285
W = 3089, C = 20, For 20Newsgroup, the number of features is also set as 20, 50, 100, 200, 500, 1000, 2000, and all (3089), respectively. Table 1 shows some characteristics of the two text corpora. Note that the density of matrix S is nnz(S)/prod(size(S)), where nnz(S) is the number of nonzero elements in S, and prod(size(S)) is the product between the number
290
of rows and the number of columns in S. As shown in the last column of Table 1, word co-occurrence matrix possesses much higher density than the original document-word matrix. 4.2. Methodology We compare our RP-GSO method with 6 unsupervised methods: Fast Con-
295
ical Hull algorithm, containing two variations based on two exterior point selec2 Available
at http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/multiclass.html#news20.
13
tion criteria (distance and greedy), i.e., FCH(dist) and FCH(greedy) [6]; Fast Anchor Words algorithm(FAW) [20], GibbsLDA [33], Laplacian Score (Lscore) [5], and unsupervised Multi-Cluster Feature Selection (MCFS) [12]. Besides, we also compare with 4 supervised feature selection methods: χ2 , IG, GS Forward 300
Selection (ForwardGS) [17, 23], and supervised MCFS (superMCFS) [12]. For FCH, we re-implement the Matlab code with the help of the author; For FAW, we rewrite the Matlab code according to authors’ Python version; For GibbsLDA, we use its java version and apply hidden topics as features, that is, we use the learned document topics matrix θ as input of classification and
305
clustering. For Laplacian Score and two MCFS variations, we use the open source Matlab code downloaded from authors’s web sites3 . For ForwardGS, we use the Matlab code in Ref. [23]. For our proposed RP-GSO, the dimension of random projection T is simply set as 200 on both corpora, one can set it more exactly according to parameter ǫ like used in Arora’s paper [20].
310
4.3. Performance Measures We evaluate our method on two different text mining tasks, i.e., classification and clustering. For classification task, we use LibSVM [34, 35] with linear kernel and 10 fold cross validation to obtain classification error rate [35]. For clustering task, we use kMeans algorithm implemented in Matlab with parameters
315
“’Distance’, ’sqEuclidean’, ’Start’,’uniform’,’EmptyAction’,’singleton’ ” and the cluster number is set as 52 (for Reuters corpus) and 20 (for 20Newsgroup corpus), respectively. We evaluate the clustering performance with the average Normalized Mutual Information (NMI) [6] of 5 clustering experiments with random initialization. Before using LibSVM or kMeans, all the input instances are
320
l2 normalized. 3 http://www.cad.zju.edu.cn/home/dengcai/Data/data.html
14
5. Experimental Results 5.1. Classification Results In this section, we shall demonstrate the superiority of our RP-GSO method when comparing against unsupervised feature selection methods in terms of 325
classification error rate. Meanwhile, we compare RP-GSO with supervised feature selection approaches. The experimental results show that our algorithm can achieve better performance on skewed and balanced corpora.
FCH(dist) FCH(greedy) GibbsLDA FAW RP−GSO Lscore MCFS
0.6
Classification Error Rate
0.5
0.4
0.3
0.2
0.1
20
50
100
200 500 Feature Number
1000
2000
all
Figure 3: The classification error rate on Reuters compared with 6 unsupervised methods.
Classification error rate on skewed Reuters. Figure 3 depicts the classification error rate of LibSVM classifier on Reuters when unsupervised feature 330
selection methods are used. As shown in Figure 3, the classification error rate decreases with the increase of the number of features, for all the unsupervised feature selection methods, except GibbsLDA. Our proposed RP-GSO achieves significantly better performance than the rest methods with 95% t-test, especially when the number of features is less than 500, the classification error rate
335
reduction ranges from 10% to 20% comparing against other unsupervised methods. For FCH and FAW methods, the former is worse than the later when the number of features is less than 100; and FCH will perform better when the number of features is bigger than 200. The difference gap between our RP-GSO
15
and FCH(dist) is narrow when the number is bigger than 500. For spectral340
clustering methods, we can observe that MCFS is superior to Lscore when the number of features is small, and then it becomes inferior with the increase of the number of features. Generally, FCH(dist) performs the second better classification error rate. When the number of features is bigger than 200, FAW method performs worse than others, because the l1 normalization of vectors af-
345
fects its performance of basis features detection for classification task. There is an interesting and natural phenomena in GibbsLDA method, our experimental results show that GibbsLDA achieves best when the the number of features is less than 100, even better than our RP-GSO. But its performance reduces drastically when topic number is bigger than 100. The reason is that we use hidden
350
topics as features of document, the supposition is satisfied when the topic number is set small, and the supposition will be destroyed after setting too big topic number. In fact, how to set the hidden topic number in topic model is still an open question.
CHI IG RP−GSO ForwardGS superMCFS
0.35
Classification Error Rate
0.3
0.25
0.2
0.15
0.1
0.05 20
50
100
200 500 Feature Number
1000
2000
all
Figure 4: The classification error rate on Reuters compared against 4 supervised methods.
Figure 4 shows the experimental comparison between RP-GSO and 4 su355
pervised methods. The trend is similar to Figure 4. The results show that our RP-GSO method still achieves best performance, but the difference gap is narrow. And the performance of RP-GSO is close to χ2 and IG when the num-
16
ber of features is bigger than 1000. Then we check the word indices selected by three methods, we find that about 90% of features are the same. It’s the 360
reason why they obtain the similar classification error rate. It is observed that superMCFS also achieved close performance compared against the above three methods, and ForwardGS approach was worst. Figure 4 demonstrates that our RP-GSO algorithm has superiority even compared with supervised feature selection methods. 0.9
FCH(dist) FCH(greedy) FAW RP−GSO Lscore MCFS
0.8
Classification Error Rate
0.7 0.6 0.5 0.4 0.3 0.2 0.1 20
50
100
200 500 Feature Number
1000
2000
all
Figure 5: The classification error rate on 20 Newsgroup compared against unsupervised methods.
0.8 CHI IG RP−GSO ForwardGS superMCFS
Classification Error Rate
0.7
0.6
0.5
0.4
0.3
0.2 20
50
100
200 500 Feature Number
1000
2000
all
Figure 6: The classification error rate on 20 Newsgroup compared with supervised methods.
365
Classification error rate on balanced Newsgroup. Figure 5 shows the 17
classification error rate of LibSVM classifier on 20 Newsgroup with unsupervised feature selection methods. In this experiment, GibbsLDA is excluded because the corpus we used is its vector representation, and it cannot be transform into sequence representation. We can observe that our proposed RP-GSO method 370
inconsistently achieves the lowest classification error rate, the error rate is significantly lower than others with 95% t-test and the reduction reaches up to about 20%. However, Figure 5 shows some detailed difference with Figure 3: (1) FCH always performs better than FAW, the reason may be that balanced corpus makes the vector space distribution more clear; (2) For FCH algorith-
375
m, FCH(greedy) is slightly better than FCH(dist), which is opposite of that in skewed Reuters corpus. (3) Lscore and MCFS methods are inferior to RP-GSO and FCH, and Lscore is slightly better than MCFS, which is different with the results in Ref [12] when dealing with image classification. The reason may be that the large assignment of cluster number affects manifold learning of data
380
points. As shown in Figure 6, it demonstrates that our RP-GSO method achieves similar classification error rate comparing against χ2 , IG, and superMCFS supervised feature selection methods, and our RP-GSO method performs slightly better than IG and superMCFS. ForwordGS obtained the worst results. It
385
should be noting that our method does not employ label information, whereas methods else are only effective for labeled corpus. 5.2. Clustering Results In this section, we use the average normalized mutual information of 5 random experiments to evaluate the performance of feature selection methods for
390
clustering task. The clustering results demonstrate that our RP-GSO method can achieve competitive NMI score to the well-known supervised χ2 , IG and superMCFS on skewed Reuters, and better NMI on balanced 20 Newsgroup. Meanwhile, our RP-GSO still performs better than unsupervised feature selection methods for clustering task.
395
NMI on skewed Reuters. Figure 7 shows the comparable results of unsu18
pervised feature selection methods on skewed Reuters. We can observe that our RP-GSO method still significantly performs better NMI score with 95% t-test, except GibbsLDA with less than 100 features. Compared against the second best FCH(dist), the improvement of our method ranges from 4% to 19.9% when 400
the number of features is less than 500. Similar to classification on Reuters, GibbsLDA obtains best performance when the number of features is small, and its performance decreases dramatically with increase of features. For FAW method, its NMI score is significantly lower than our RP-GSO, even lower than FCH methods, the reason may be that clustering task requires more seman-
405
tically related features, whereas the selected features by FAW cannot satisfy this point. For clustering task, Lscore performs better than MCFS, which is only better than FAW. As shown in Figure 7, the difference between Lscore and RP-GSO is significant. 0.7
Normalized Mutual Information
0.6
0.5
0.4
0.3 FCH(dist) FCH(greedy) GibbsLDA FAW RP−GSO Lscore MCFS
0.2
0.1
0 20
50
100
200 500 Feature Number
1000
2000
all
Figure 7: The normalized mutual information value of unsupervised methods on Reuters.
Figure 8 depicts the results between our RP-GSO method and supervised 410
methods. The NMI score demonstrates that supervised feature selection methods perform slightly better than our RP-GSO, except ForwardGS. When using less features, χ2 and superMCFS can achieve significantly better than our RPGSO, then the performance is close to each other. NMI on balanced Newsgroup. We continue to compare NMI score on
415
balanced 20 Newsgroup. As shown in Figure 9, our RP-GSO method performs 19
Normalized Mutual Information
0.7 0.6 0.5 0.4 CHI IG RP−GSO ForwardGS superMCFS
0.3 0.2 0.1 20
50
100
200
500
1000
2000
all
Feature Number
Figure 8: The normalized mutual information value of supervised methods on Reuters.
0.5 0.45
Normalized Mutual Information
0.4 0.35 0.3 0.25 0.2 FCH(dist) FCH(greedy) FAW RP−GSO Lscore MCFS
0.15 0.1 0.05 0 20
50
100
200 500 Feature Number
1000
2000
all
Figure 9: The normalized mutual information value of unsupervised methods on balanced 20 Newsgroup.
superiorly in terms of NMI, and the improvement is significant when features are less than 200. FAW obtains a bad NMI score, which is lower than 0.1 when the number of features is smaller than 1000, and the similar case happens to MCFS method with less than 200 features. It means that FAW and MCFS methods fail 420
in this clustering task when selecting less features. Similar to Figure 7, Lscore method inconsistently performs better than MCFS method. Besides, FCH(dist) method is slightly better than FCH(greedy), and the difference between them is small, so we prefer to FCH(dist), because it is more efficient than FCH(greedy) method.
20
0.5
Normalized Mutual Information
0.45 0.4 0.35 0.3 0.25 CHI IG RP−GSO ForwardGS superMCFS
0.2 0.15 0.1 20
50
100
200 500 Feature Number
1000
2000
all
Figure 10: The normalized mutual information value of supervised methods on balanced 20 Newsgroup.
425
Figure 10 shows that our RP-GSO method achieves competitive NMI score, comparing against χ2 , IG, and superMCFS supervised approaches. When the number of features is less than 200, our RP-GSO is worse than χ2 , but is still better than IG method. Our method performs best if we choose more than 200 features.
430
5.3. Influence of Input Space In this section, we observe the influence of input space for our RP-GSO algorithm. Because our RP-GSO algorithm can be applied on any matrix, we directly use the original matrix X as the input of RP-GSO, denoted as RPGSO-X. Compared to word co-occurrence matrix Q, the original word-document
435
matrix X is sparse. As shown in Table 1, the density of X on two data sets are only 1.27% and 1.6%, respectively, which will affect GS-Orthogonalization process. The Figures 11 and 12 show the performance comparisons between X and Q when the same feature selection algorithm RP-GSO is used. For classifica-
440
tion task, as shown in Figure 11 (a), we can observe that RP-GSO-X performs consistently inferior to our RP-GSO in terms of classification error on Reuters corpus. Figure 11 (b) also shows the same trend when the experiment is executed on 20 Newsgroup. Meanwhile, It is noted that the classification error is 21
RP−GSO RP−GSO−X
RP−GSO RP−GSO−X
0.35
0.7
0.3 0.6
Classification Error
Classification Error
0.25
0.2
0.5
0.15
0.4
0.1
0.3
0.05 0.2 20
50
100
200 500 Feature Number
1000
2000
20
all
50
100
200 500 Feature Number
1000
2000
all
(b)
(a)
Figure 11: The influences of input space on classification error.
the same when the number of features is bigger than 1000. 445
Next we continue to compare the performance between different input space on clustering task. As shown in Figure 12, our RP-GSO still performs better than RP-GSO-X, which uses word-document matrix as input. For Reuters corpus, as shown in Figure 12 (a), we find that word co-occurrence matrix Q is still benefit to our algorithm, and sparse word-document X will affect the performance of feature selection. For 20 Newsgroup corpus, the affection is significant when the number of features is smaller than 200, as shown in Figure 12 (b). 0.5
0.6
0.45
0.55 0.4
0.5 0.35
NMI
NMI
450
0.45 0.3 0.4 RP−GSO RP−GSO−X
0.25 RP−GSO RP−GSO−X
0.35 0.2 0.3 20
50
100
200 500 Feature Number
1000
2000
all
20
50
100
200 500 Feature Number
1000
2000
all
(b)
(a)
Figure 12: The influences of input space on clustering NMI.
From the above experiments, we can conclude that the sparseness of matrix
22
affects the performance of RP-GSO method. Compared with original sparse word-document matrix, dense word co-occurrence matrix is better for the de455
tection of basis vectors. 5.4. l2 vs. l1 Normalization In this section, we compare the performance between l2 and l1 normalization. As shown in Figure 13, we observe that the classification error with l2 normalization is smaller than that with l1 normalization on both Reuters and
460
20Newsgroup corpora. Especially on 20Newsgroup, the difference between l2 and l1 normalization is significant when the number of features is less than 1000. 0.9
0.8 RP−GSO on Reuters RP−GSO+l1 on Reuters 0.7
RP−GSO on Newsgroup RP−GSO+l1 on Newsgroup
Classification Error
0.6
0.5
0.4
0.3
0.2
0.1
0 20
50
100
200 500 Feature Number
1000
2000
all
Figure 13: The comparative curves of classification error between l2 and l1 normalization.
For clustering task on both corpora, the comparative curves of NMI between l2 and l1 normalization are shown in Figure 14. The similar phenomenon can 465
be observed, that is, the NMI of l2 is consistently superior to that of l1 . Therefore, for feature selection task, we observe that the performance of l2 normalization is better than l1 normalization. 5.5. Parameter Settings In RP-GSO algorithm, R is a W × T matrix, and parameter T is manually
470
set. We constructed classification and clustering experiments on both corpora
23
Normalized Mutual Information
0.6
0.5
0.4
0.3
RP−GSO on Reuters RP−GSO+l1 on Reuters RP−GSO on Newsgroup RP−GSO+l1 on Newsgroup
0.2
0,1 20
50
100
200
500
1000
2000
all
Feature Number
Figure 14: The comparative curves of NMI between l2 and l1 normalization.
0.94 0.92 0.55 0.9
0.5
20 Newsgroup
20 Newsgroup
0.86
Reuters Reuters
NMI
Classification Accuracy
0.88
0.84
0.45
0.82 0.8
0.4
0.78 0.76
0.35
0.74 0
100
200
300
400 500 T value
600
1000
2000
0
all
100
200
300
400 500 T value
600
1000
2000
all
(b) The number of features is 200
(a) The number of features is 1000
Figure 15: Influences of T on classification accuracy and NMI
to analyze the performance of RP-GSO with different T values. As shown in Figure 15, the subfigure (a) shows the classification accuracy trend with different T values, and the subfigure (b) shows the NMI trend. We observe that the influence of parameter T is insignificant on both Reuters and 20 Newsgroup in 475
terms of classification accuracy and NMI score. Specifically, we set the number of features as 1000 for classification task, and the accuracy fluctuation is very small when T value is from 100 to 3000. The same phenomenon happens for clustering task when the number of features is set as 200. When T is equal to 0, it means no random projection is used. We observe that the accuracy or NMI
480
without random projection is close to that with random projection, but using
24
random projection can significantly reduce computing time. 5.6. Discussions When topic model is analyzed from matrix perspective, we can obtain some insights. Actually, recent work [19, 20, 6, 21] explorers the same concept “Sep485
aration” to recover topic word matrix from different viewpoints. Our paper is also inspired by Arora et al.’s work [20], but we focus on feature selection. From the above extensive experimental results, we observe that our RP-GSO feature selection method can achieve competitive or better performance comparing against unsupervised and supervised methods. Here we only verify the
490
effectiveness of RP-GSO on classification and clustering tasks, we also believe that it will be effective on other text mining task. The second point is that it is suitable to large scale data, because the generation of matrix Q, row l2 normalization, and Gram-Schmidt process can be parallelized. We will verify the efficiencies of RP-GSO on large scale text corpora in our future work.
495
6. Conclusions Unsupervised feature selection will be a good tool to help us deal with large scale data, because of the absence of label information. Inspired by anchor words, we adopt random projection and Gram-Schmidt process to detect the basis vectors of word co-occurrence matrix. And our extensive experiments on text
500
classification and clustering tasks demonstrate the superiority of our proposed RP-GSO algorithm in terms of classification error rate and normalized mutual information, when LibSVM and kMeans algorithm are used, respectively. In fact, our RP-GSO algorithm can be parallelized on matrix computation, and we will verify the effectiveness and efficiencies of RP-GSO on large scale text
505
corpora in the future work. 7. Acknowledgments We thank Abhishek Kumar for helping us implement the Matlab code of Fast Conical Hull Algorithm [6], which gives us impressive understanding about his 25
algorithm. This research was supported by China Postdoctoral Science Founda510
tion funded project (Grant No. 2014M550591) and partly supported by the National Natural Science Foundation of China (Grant No. 71471009). Dr. Zhang was supported by SKLSDE-2015ZX-13. The Matlab code of our RP-GSO algo˜ rithm can be downloaded from http://www.nlsde.buaa.edu.cn/dqwang. [1] F. Sebastiani, Machine learning in automated text categorization, ACM
515
Computing Surveys 34 (1) (2002) 1–47. [2] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research 3 (2003) 1157–1182. [3] Y. Yang, J. O. Pedersen, A comparative study on feature selection in text categorization., In: Proceedings of ICML (1997) 412–420.
520
[4] D. Wang, J. Wu, , K. Xu, M. Lin, Towards enhancing centroid classifier for text classification — a border-instance approach, Neurocomputing 101 (2013) 299–308. [5] X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, In Advances in Neural Information Processing Systems 18 (2005) 1157–1182.
525
[6] A. Kumar, V. Sindhwani, P. Kambadur, Fast conical hull algorithms for near-separable non-negative matrix factorization, In: Proceedings of the 30 th International Conference on Machine Learning (2013) 231–239. [7] L. Wolf, A. Shashua, Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based approach, J. Mach.
530
Learn. Res. 6 (2005) 1855–1887. [8] Z. Zhao, H. Liu, Spectral feature selection for supervised and unsupervised learning, In: Proceedings of the 24th Annual International Conference on Machine Learning (2007) 1151–1157. [9] H. Xiong, J. Wu, J. Chen, K-means clustering versus validation measures:
535
A data-distribution perspective, IEEE Transactions on Systems, Man, and Cyberneticspart B: Cybernetics 39 (2) (2009) 318–331. 26
[10] C.-T. Su, Y.-H. Hsiao, Multiclass mts for simultaneous feature selection and classification, IEEE Trans. Knowledge and Data Engineering 21 (2009) 192–205. 540
[11] D. Tao, X. Li, X. Wu, S. J. Maybank, Geometric mean for subspace selection, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2) (2009) 260–274. [12] D. Cai, C. Zhang, X. He, Unsupervised feature selection for multi-cluster data, In: Proceedings of the 16th ACM SIGKDD Conference on Knowledge
545
Discovery and Data Mining (2010) 333–342. [13] A. J. Ferreira, M. A. Figueiredo, An unsupervised approach to feature discretization and selection, Pattern Recognition 45 (9) (2012) 3048 – 3060. [14] H. Yan, J. Yang, Joint laplacian feature weights learning, Pattern Recognition 47 (3) (2014) 1425–1432.
550
[15] D. Wang, H. Zhang, R. Liu, W. Lv, D. Wang, T-test feature selection approach based on term frequency for text categorization, Pattern Recognition Letters 45 (2014) 1–10. [16] S. Chen, S. A. Billings, W. Luo, Orthogonal least squares methods and their application to non-linear system identification, International Journal
555
of control 50 (5) (1989) 1873–1896. [17] H. Stoppiglia, G. Dreyfus, R. Dubois, Y. Oussar, Ranking a random feature for variable and feature selection, The Journal of Machine Learning Research 3 (2003) 1399–1414. [18] D. Donoho, V. Stodden, When does non-negative matrix factorization give
560
the correct decomposition into parts?, In: Proceedings of NIPS. [19] S. Arora, R. Ge, R. Kannan, A. Moitra, Computing a nonnegative matrix factorization – provably, In: Proceedings of the 44th Symposium on Theory of Computing (2012) 145–162. 27
[20] S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, 565
M. Zhu, A practical algorithm for topic modeling with provable guarantees, In: Proceedings of the 30 th International Conference on Machine Learning (2013) 280–288. [21] W. Ding, M. H. Rohban, P. Ishwar, V. Saligrama, Topic discovery through data dependent and random projections, In: Proceedings of the 30 th In-
570
ternational Conference on Machine Learning (2013) 1202–1210. [22] G. H. Golub, C. F. Van Loan, Matrix Computations (3rd ed.), Johns Hopkins University Press, 1996. [23] I. Guyon, Practical feature selection: from correlation to causality, Mining massive data sets for security (2008) 27–43.
575
[24] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, Journal of Machine Learning Research 3 (2003) 993–1022. [25] X. Hu, N. Sun, C. Zhang, T.-S. Chua, Exploiting internal and external semantics for the clustering of short texts using world knowledge, In: Proceedings of the 18th ACM Conference on Information and Knowledge Man-
580
agement (2009) 919–928. [26] O. Jin, N. N. Liu, K. Zhao, Y. Yu, Q. Yang, Transferring topical knowledge from auxiliary long texts for short text clustering, In:Proceedings of the 20th ACM International Conference on Information and Knowledge Management (2011) 775–784.
585
[27] X.-H. Phan, C.-T. Nguyen, D.-T. Le, L.-M. Nguyen, A hidden topic-based framework toward building applications with short web documents, IEEE Transactions on Knowledge and Data Engineering 23 (7) (2011) 961–976. [28] C. Ding, H. Peng, Minimum redundancy feature selection from microarray gene expression data, In: Proceedings of the 2nd IEEE Computer Society
590
Bioinformatics Conference (CSB 2003) (2003) 523–529.
28
[29] H. Peng, F. Long, C. Ding, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, EEE Transactions on Pattern Analysis and Machine Intelligence 27 (8) (2005) 1226C1238. 595
[30] C.-C. Yen, L.-C. Chen, S.-D. Lin, Unsupervised feature selection: Minimize information redundancy of features, In: Proceedings of the 2010 International Conference on Technologies and Applications of Artificial Intelligence (2010) 247–254. [31] S.
600
no,
Arora, A.
R. Moitra,
Ge, D.
Y.
Halpern,
Sontag,
Y.
D. Wu,
M.
Mim-
M.
Zhu,
A practical algorithm for topic modeling with provable guarantees, CoRR abs/1212.4777. URL http://arxiv.org/abs/1212.4777 [32] D. 605
Achlioptas,
Database-friendly
random
projections:
Johnson-
lindenstrauss with binary coins, Journal of Computer and System Sciences 66 (4) (2003) 671–687. [33] T. L. Griffiths, M. Steyvers, Finding scientific topics, Proceedings of the National academy of Sciences of the United States of America 101 (1) (2004) 5228–5235.
610
[34] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (1995) 273–297. [35] C.-C. Chang, C.-J. Lin, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (2011) 1–27, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
29