Available online at www.sciencedirect.com Available online at www.sciencedirect.com
Procedia Engineering
Procedia Engineering 00 (2011) 000–000 Procedia Engineering 15 (2011) 1565 – 1569 www.elsevier.com/locate/procedia
Advanced in Control Engineering and Information Science
Document classification algorithm based on MMP and LSSVM Ziqiang Wanga*, Xia Suna a
Henan University of Technology, School of Information Science and Engineering, Zhengzhou 450001,China
Abstract Document classification is a hard but important research topic. To effectively resolve this problem, a novel document classification algorithm is proposed by using maximum margin projection(MMP) and least square support vector machines (LS-SVM). The high-dimensional document data is first projected into lower-dimensional feature space via MMP algorithm, then the LS-SVM classifier is used to classify the test documents into different class in terms of the extracted semantic features. Experiments performed on two popular document datasets demonstrate the superior performance of the proposed document classification algorithm.
© 2011 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. Selection and/or peer-review under responsibility of [CEIS 2011] Keywords:document classification;maximum margin projection(MMP);least square support vector machines (LS-SVM)
1. Introduction Over the last decade, with the increasing availability of powerful computing platforms and highcapacity storage hardware, the number of digital documents available for various purposes has grown tremendously. The dramatic increase of documents demands efficient organizing and retrieval methods, especially for a large document database. Document classification is one of the most crucial techniques to organize the documents in a supervised manner. In real document classification tasks, the raw document data are often very high dimensional, ranging from several hundreds to thousands, so direct operations on
* Corresponding author. Tel.: +86-371-6775-6538; fax: +86-371-6775-6538. E-mail address:
[email protected].
1877-7058 © 2011 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. doi:10.1016/j.proeng.2011.08.291
1566
Ziqiang andalXia Sun / Procedia Engineering 15 (2011) 1565 – 1569 ZiqiangWang Wang,et / Procedia Engineering 00 (2011) 000–000
them are computationally expensive and obtain unsatisfactory results. A common way to cope with this problem is to use dimensionality reduction. The purpose of dimensionality reduction is to transform a high-dimensional data set into a low-dimensional space, while retaining most of the underlying structure in the data. Once the high-dimensional document data are projected into a lower-dimensional document semantic space, the traditional classification algorithms can be applied in the low-dimensional feature space. Therefore, how to extract discriminating document features and how to classify a new document based on the extracted features are two critical issues of document classification algorithms. In this paper, the objective is to investigate the performance of document classification algorithm based on maximum margin projection(MMP) and least square support vector machines (LS-SVM). The main idea of the algorithm is as follows: The high-dimensional document data is first projected into lowerdimensional feature space by using MMP algorithm, then the LS-SVM classifier is used to classify the test documents into different class based on the extracted semantic features. The rest of the paper is organized as follows. Section 2 reports document classification algorithm based on MMP and LS-SVM. Experimental results are reported in Section 3. Finally, Section 4 offers conclusions. 2. Document Classification Algorithm Based on MMP and LS-SVM 2.1. The MMP-based document dimensionality reduction Maximum margin projection(MMP) is a recently proposed manifold learning algorithm for dimensionality reduction[1]. It is based on locality preserving neighbor relations and explicitly exploits the class information for classification. It is a graph-based approach for learning a linear approximation to the intrinsic data manifold by making use of both labeled and unlabeled data. Its goal is to discover both geometrical and discriminant structures of the data manifold. Given a set of documents { x1 ,K , xn } ⊂ � m and the corresponding class label c1 , c2 ,K , cn ∈ {1, 2,K , p} , the document set can be represented as a term-document matrix X = [ x1 , x2 ,K , xn ] . MMP aims to seek a document feature subspace that preserves the local geometrical and discriminant structures of the highdimensional document manifold. Let S b and S w denote weight matrices of between-class graph Gb and within-class graph G w respectively. MMP can be obtained by solving the following maximization problem: (1) a opt = arg max a T X (βLb + (1 − β )S w )X T a a subject to (2) a T XDw X T a = 1 where Lb = Db − S b is the Laplacian matrix of Gb , Db is a diagonal matrix whose entries on diagonal n are column sum of S b , i.e., Db , ii = ∑ S b , ij . Likewise, Lw = Dw − S w is the Laplacian matrix of G w , Dw n j =1 is a diagonal matrix whose entries on diagonal are column sum of S w , i.e., Dw,ii = ∑ S w,ij . j =1 The definitions of weight matrices S b and S w are as follows: ⎧γ , if xi and x j share the same label, ⎪ (3) S w,ij = ⎨1, if xi or x j is unlabeled but xi ∈ N w ( x j ) or x j ∈ N w ( xi ) , ⎪⎩0, otherwise. ⎧ Sb,ij = ⎨1, if xi ∈ N b ( x j ) or x j ∈ N b ( xi ) , ⎩0, otherwise.
(4)
Ziqiang Wang andand XiaXia SunSun / Procedia Engineering 15 (2011) 1565 – 1569 Ziqiang Wang / Procedia Engineering 00 (2011) 000–000
{
}
where N ( xi ) = xi , K , xi denote the set of its k nearest neighbors, and l ( xi ) represents the label j j of xi . Specially, N b ( xi ) = xi l xi ≠ l ( xi ), j = 1, K , k contains the neighbors having different labels, and N w ( xi ) = N ( xi ) − N b ( xi ) contains the rest of the neighbors. Finally, by using simple algebraic transformation, the projection vectors of MMP are the eigenvectors associated with the largest eigenvalues of the following generalized eigenvalue problem: X (βLb + (1 − β )S w )X T a = λXDw X T a (5) where β ∈ 0,1 is a suitable constant which controls the weight between a within-class graph and a between-class graph. To overcome the singular problem, we can first project the document vector into the SVD subspace by throwing away those zero singular values so that the matrix XDw X T becomes nonsingular. Let WMMP = [ a1 , a2 ,K , al ] be the solutions of (5), ordered according to their eigenvalues, λ1 > λ2 > K > λl .Then, the MMP embedding can be defined as follows: T (6) xi → yi = (WSVDWMMP ) xi where y i is the lower-dimensional representation of the document xi ,and WSVDWMMP is the transformation matrix. 1
k
{ ( )
}
[ ]
2.2. The LS-SVM classifier for document classification As for document classification, classifier selection is another key issue after dimensionality reduction. Found on statistical learning theory, support vector machine(SVM)[2] classifier has drawn much attention due to their good performance in practical applications and their solid theoretical foundations. To reduce the computational demand of SVM, the least square version of SVM(LS-SVM)[3] was adopted as document classifier in this paper, LS-SVM avoids solving quadratic programming problem and simplifies the training procedure. Consequently, LS-SVM classifier is computationally efficient. Consider a linearly separable binary document classification problem: n (7) {( xi , yi )}i =1 and yi ∈ {+1, −1} where xi is the lower-dimension document feature vector and yi is the label of this vector. The LS-SVM classifier is constructed by minimizing the below objection function: 1 1 n (8) min L ( w, b, e ) = wT w + C ∑ ei2 w,b , e 2 2 i =1 with the constraint (9) yi ⎡⎣ wT ϕ ( xi ) + b ⎤⎦ = 1 − ei where C > 0 is a regularization factor, b is a bias term, ei is the difference between the desired output and the actual output, and ϕ ( xi ) is a nonlinear mapping function. The Lagrangian for problem (8) under the constraint (9) is defined as follows: n 1 1 n R ( w, ei , b, α i ) = wT w + C ∑ ei2 − ∑ α i yi ⎡⎣ wT ϕ ( xi ) + b ⎤⎦ − 1 + ei (10) 2 2 i =1 i =1 where α i are Lagrangian multipliers. The solution can be found through the following linear system: ⎡0 ⎤ ⎡ b ⎤ ⎡0 ⎤ −Y T = ⎢v ⎥ (11) ⎢ T −1 ⎥ ⎢ ⎥ ⎣Y ϕϕ + C I ⎦ ⎣α ⎦v ⎣1 ⎦ T T where ϕ = ⎡ϕ ( x1 ) y1 ,K , ϕ ( xn ) yn ⎤ , Y = [ y1 ,K , yn ] , 1 = [1,K ,1] , and α = [α1 ,K , α n ] . ⎣ ⎦
{
}
1567
1568
Ziqiang andalXia Sun / Procedia Engineering 15 (2011) 1565 – 1569 ZiqiangWang Wang,et / Procedia Engineering 00 (2011) 000–000
Thus for a given kernel function K ( , ) , the LS-SVM classifier is given as follows: ⎡ n ⎤ f ( x ) = sgn ⎢ ∑ α i yi K ( x, xi ) + b ⎥ (12) ⎣ i =1 ⎦ where K (, ) is a kernel function. In this experiment, we adopt the following normalized polynomial kernel function since it has achieved better performance in many pattern recognition tasks: k ( xi , x j ) (13) K ( xi , x j ) = k ( xi , xi ) k ( x j , x j ) where k ( , ) is the polynomial kernel. As can be seen, the LS-SVM classifier is constructed by solving the linear system (11) rather than the quadratic programming, which is computationally efficient. 2.3. The document classification algorithm description The document classification algorithm procedure based on MMP and LS-SVM can be outlined as follows: Step1: SVD Projection. To overcome singular values problem, we project the document set {xi } into the SVD subspace by throwing away those zero singular values. Step2: Constructing the within-class and between-class graphs and choosing the edge weights in terms of (3) and (4) respectively. Step3: Building the optimal objective function of MMP according to (1) and (2). Step4: The projection vectors of MMP are obtained by solving the generalized eigenvector problem in (5), and the lower-dimensional feature representation of high-dimensional document data can be obtained in terms of (6). Step5: Constructing the optimal objective function of LS-SVM in the reduced lower-dimensional document feature space according to (8) and (9). Step6: The Lagrangian multipliers α i are first obtained by solving the linear system in (11), then the LS-SVM classifier is applied to classify different documents in terms of (12). 3. Experimental results
To test the effectiveness of our approach, comprehensive performance study has been conducted using two standard text dataset: Reuter-21578[4] and WebKB[5]. We compare our proposed document classification algorithm based on MMP and LS-SVM (MMP-LSSVM) with document classification algorithm based on LSI[6] and SVM (LSI-SVM), document classification algorithm based on LDA[7] and SVM (LDA-SVM), document classification algorithm based on LPP[8] and SVM (LPP-SVM). In all the experiments, we simply removed the stop words and no further preprocessing was done. Each document is represented as a term-frequency vector and normalized one. To evaluate the classification performance, we use the two well-known metrics, the classification accuracy rate and the F1 measure[6]. In our experiments, we randomly chose 50% of the documents for training and assigned the rest for testing. We conduct the training-test procedure three times and use the average of the three performances as final result. The classification accuracy rate and the F1 measure on the Reuter-21578 and WebKB are reported on the Table 1 and Table 2. From the experimental results, we can make the following observations: 1)Our proposed MMP-LSSVM algorithm consistently outperforms LSI-SVM, LDA-SVM and LPP-SVM algorithms. The possible reason may be explained as follows: MMP simultaneously considers the local manifold and discriminant structures, thus achieving maximum
Ziqiang Wang andand XiaXia SunSun / Procedia Engineering 15 (2011) 1565 – 1569 Ziqiang Wang / Procedia Engineering 00 (2011) 000–000
discrimination; In addition, the computation efficient LS-SVM algorithm further improves the classification performance. 2) The LSI-SVM algorithm performs the worst among the four classification algorithms, which indicates that LSI might not be optimal in classifying documents. 3) The LDA-SVM algorithm performance worse than LPP-SVM algorithm. This is probably because LDA fails to discover the underlying nonlinear manifold structure which is important for document classification problem. Table 1. Performance comparison on the Reuters-21578 data set Classification algorithm
Accuracy rate (%)
F1 measure (%)
LSI-SVM
79.57
72.83
LDA-SVM
88.96
85.49
LPP-SVM
91.82
89.51
MMP-LSSVM
95.64
92.67
Classification algorithm
Accuracy rate (%)
F1 measure (%)
LSI-SVM
81.26
77.59
Table 2. Performance comparison on the WebKB data set
LDA-SVM
87.43
83.72
LPP-SVM
90.58
86.36
MMP-LSSVM
94.37
91.85
4. Conclusions
This paper deals with the important problem of dimensionality reduction and classifier selection for document classification. A novel document classification based on MMP and LS-SVM is proposed. Experimental results demonstrate that the proposed algorithm achieves much better performance than traditional document classification algorithms. References [1] He Xiaofei, Cai Deng, Han Jiawei.Learning a maximum margin subspace for image retrieval. IEEE Transactions on Knowledge and Data Engineering 2008; 20(2):189-201. [2] Vapnik V. Statistical Learning Theory. New York:John Wiley&Sons;1998. [3] Suykens J A K, Vandewalle J. Least squares support vector machine classifiers. Neural Processing Letters 1999; 9(3):293300. [4] Reuters-21578. http://www.daviddlewis.com/ resources/testcollections/ reuters21578/,2004. [5] Craven M,Dipasquo D,Freitag D,et al.Learning to extract symbolic knowledge from the world wide Web.Proceedings of National Conference on Artificial Intelligence,Cambridge:MIT Press;1998,p.509-516. [6] Rijsbergen C J van. Information Retrieval. London: Butterworths;1979. [7] Duda R O,Hart P E,Stork D G. Pattern Classification. Second edition. New Yoek:Wiley-Interscience;2000. [8] He X, Niyogi P. Locality preserving projections.Advances in Neural Information Processing Systems, Cambridge: MIT Press;2003,p.153-160.
1569