applied to text categorization, we use two different classifiers, k-nearest-neighbor classifier .... 5 Tom Mitchell. Machine Learning. McCraw Hill, 1996. 6 Pawlak Z.
A Knowledge-Based Feature Selection Method for Text Categorization Yan Xu1,2, JinTao Li1, Bin Wang1,ChunMing Sun1,2 1 Institute of Computing Technology,Chinese Academy of Sciences No.6 Kexueyuan South Road, Zhongguancun,Haidian District, Beijing,China {xuyan, jtli, wangbin, sunchunming}@ict.ac.cn 2 Dept.of computer science,NCEPU(BJ),Beijing 102206
Abstract. A major difficulty of text categorization is the high dimensionality of the original feature space. Feature selection plays an important role in text categorization. Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), mutual information (MI), and so on are commonly applied in text categorization. Many existing experiments show IG is one of the most effective methods. In this paper, a method is proposed to measure attribute’s importance based on Rough Set theory. According to Rough set theory, knowledge about a universe of objects may be defined as classifications based on certain properties of the objects, i.e. Rough set theory assumes that knowledge is an ability to partition objects. We quantify the ability of partition objects, and call the amount of this ability as knowledge quantity, and than put forward a knowledge-based feature selection method called KG. Experimental results on NewsGroup and OHSUMED corpora show that KG performs much better than MI, DF, even than IG.
1
Introduction
Text categorization is the process of grouping texts into one or more predefined categories based on their content. Due to the increased availability of documents in digital form and the rapid growth of online information, text categorization has become one of the key techniques for handling and organizing text data. A major difficulty of text categorization is the high dimensionality of the original feature space. Consequently, feature selection-reducing the original feature space is seriously projected and carefully investigated. In recent years, a growing number of statistical classification methods and machine learning techniques have been applied in this field. Many feature selection methods such as document frequency thresholding(DF), information gain measure(IG), mutual information measure(MI), and so on have been widely used. Many existing experiments show IG is one of the most effective methods[1][2][3]. It is well known that the attributes are not important equally in information system, but, which attribute is important and which attribute is unimportant, even redundant? For information measurement, the field of Information Theory has its origin in Claude Shannon's 1948 paper "A Mathematical Theory of Communications". Written
information has the property of reducing the uncertainty of a situation. The measurement of information is thus the measurement of the uncertainty. Rough Set theory, which is a very useful tool to describe vague and uncertain information, regards knowledge as an ability to partition objects. We quantify the ability of partition objects, and call the amount of this ability as knowledge quantity. According to the knowledge quantity we propose a rough set feature selection method named knowledge-gain method(KG). Experimental results on NewsGroup and OHSUMED corpora show that KG performs much better than MI, DF even than IG.
2
Feature selection methods and rough set theory
In this section we reexamine feature selection methods DF, IG and MI which are commonly used in feature selection for text categorization, and DF, IG both have good performance for feature selection in TC, with IG being respected to have the best performance in many experiments[1][2][3]. The following definitions of DF, IG and MI are taken from [1]. 2.1
Document frequency thresholding
Document frequency is the number of documents in which a term occurs. Only the terms that occur in a higher number of documents are retained. DF thresholding is the simplest technique for vocabulary reduction. It easily scales to very large corpora with a computational complexity approximately linear in the number of training documents. 2.2
Information gain
Information gain is commonly used as a term goodness criterion in machine learning [4][5]. It measures the amount of information obtained for category prediction by knowing the presence or absence of a term in a document. Let m
{ci }im=1 denote the
set of categories in the target space. The information gain of term t is defined to be:
G (t ) = −∑i =1 p (ci ) log p (ci ) + p (t )∑i =1 p (ci | t ) log p (ci | t ) + m
m
+ p (t )∑i =1 p (ci | t ) log p (ci | t ) m
Given a training corpus, for each unique term the information gain is computed and those terms whose information gain is less than some predetermined threshold are removed from the feature space.
2.3
Mutual information
Mutual information is a criterion commonly used in statistical language modeling of word associations and related applications. Given a category c and a term t, the mutual information criterion between t and c is defined as:
I (t , c) = log 2
p (t ∧ c) p (t ) × p (c)
These category-specific scores of a term are then combined to measure the m
goodness of the term at a global level. Let {ci }i =1 denote the set of categories in the target space. Typically it can be done in two alternate ways:
I avg (t ) = ∑i =1 p (ci ) I (t , ci ) , I max (t ) = max{I (t , ci )} m
m
i =1
After the computation of these criteria, thresholding is performed to achieve the desired degree of feature elimination from the full vocabulary of a document corpus.
2.4
Basic concepts of rough set theory
Rough set theory, introduced by Zdzislaw Pawlak in 1982 [6][7][8], is a mathematical tool to deal with vagueness and uncertainty. At present it is widely applied in many fields, such as machine learning, knowledge acquisition, decision analysis, knowledge discovery from databases, expert systems, pattern recognition, etc. In this section, we introduce some basic concepts of rough set theory which used in this paper. Given two sets U and A, where U ={x1, ..., xn} is a nonempty finite set of objects called the universe, and A = {a1,…, ak} is a nonempty finite set of attributes, the attributes in A is further classified into two disjoint subsets, condition attribute set C and decision attribute set D, A=C∪D and C∩D = Φ. Each attribute a∈A, V is the domain of values of A, Va is the set of values of a, defining an information function fa, : U→Va, we call 4-tuple as an information system. a(x) denotes the value of attribute a for object x . Any subset B ⊆ A determines a binary relation Ind(B) on U, called indiscemibility relation: Ind(B)={ (x,y) ∈ U×U | ∀a∈B , a(x) = a(y) } The family of all equivalence classes of Ind(B), namely the partition determined by B, will be denoted by U/B . If ( x , y ) ∈ Ind(B), we will call that x and y are Bindiscernible .Equivalence classes of the relation Ind(B) are referred to as B elementary sets. The elementary sets are the basic blocks of our knowledge about reality, sometimes called as concepts. 2.5
Relevant work
Many text categorization methods are presented and some works are based on Rough Set. The paper[11] locates a minimal set of coordinate keywords firstly to distinguish
between classes of documents, and then use rough set to reduce the dimensionality of the keyword vectors. The paper[10] proposes a hybrid technique using Latent Semantic Indexing (LSI) and Rough Sets theory. The paper[9] introduces a hybrid method to select features using some feature selection method and rough set theory, it selects features firstly using one of feature selection methods, such as mutual information, information gain, and then further select features using rough set. In text categorization, each document is described by a vector of extremely high dimensionality, so most rough set-based methods use rough set and other technique at the same time, such as [9][10][11]. This paper does not use other technique, but only according to rough set theory that knowledge is an ability to partition objects. We quantify the ability of partition objects, and call the amount of this ability as knowledge quantity, then put forward a knowledge-based feature selection method KG.
3
3.1
Knowledge measurement based on Rough Set
The ability to discern objects
An example of an information table is given in Table 1[12]. Rows of table 1, labeled with E1, E2, …,E8, are elements(objects), the features are X and Y, where X=College Major, Y=Likes “Gladiator”: Table 1. An information table for text E1 E2 E3 E4 E5 E6 E7 E8
X Math History CS Math Math CS History Math
Y Yes No Yes No No Yes No Yes
The important concept in rough set theory is indiscernibility relation. For example, in table 1, (E1, E4) is X-indiscernible, (E1, E2) is not X-indiscernible. In table 1, X divides { E1, E2, …,E8} into three equivalence classes { E1, E4, E5,E8} , {E2, E7} and { E3, E6}. That is to say, T1 can discern E2, E7 from E1, E4, E5, E8. Similarly, Y can discern E1, E3, E6, E8 from E2, E4, E5, E7.
Now we quantify the ability of discerning objects for a feature or a set of features P, we call the amount of the ability of discerning objects as knowledge quantity. When computing knowledge quantity, we take the following considerations into account: z When each object is discernible from the other by feature set P, P has the largest knowledge quantity; z When all elements can only be divided into one equivalence class by P, that is to say, P can’t distinguish any object from the others, P has the smallest knowledge quantity. 3.2
Knowledge quantity
This section will be discussed on information table (Let decision feature set D = Φ). Definition 1. The object domain set U is divided into m equivalence classes by the set P (some features in information table), the probability of elements in each equivalence class is: p1, p2,…, pm, let WP denotes the knowledge quantity of P, WP=W(p1,p2,...,pm), and it satisfies the following conditions: 1) if m = 1 then W(p1)=W(1) = 0 2) W(p1, ... ,pi ,…, pj, … , pm)= W(p1, … ,pj , ..., pi , ... , pm) 3) W(p1,p2, ... ,pm)= W(p1,p2+...+pm)+ W(p2,... ,pm) 4) W(p1,p2+p3)= W(p1,p2)+ W(p1,p3) This can be explained as: If some P can’t discern any object from the other one in the domain U, i.e. only one equivalence class is divided by P in the domain U, then the ability of discern objects for P is 0, i.e. knowledge quantity of P is 0. If the domain U is divided into m equivalence classes by some feature set P, for the different orders of the same equivalence classes, the same feature set P should have the same ability of discerning objects, so W(p1,..., pi,…, pj, …, pm)=W(p1,…, pj,... , pi, ..., pm). If the domain U is divided into m equivalence classes E1, E2,…, Em by some feature set P, and the probability of elements in each equivalence class is p1, p2,…, pm, that is, E1 can be discerned from E2 ∪E3 ∪…∪Em by P, and E2 ∪E3 ∪…∪Em can be divided into m-1 equivalence classes E2, E3,…, Em by P. So W(p1,p2,... ,pm)= W (p1,p2+...+pm)+ W(p2,... ,pm) If the domain U is divided into two equivalence classes E1 and E2 by some feature set P, and the probability of elements in E1 and E2 is p1 and p2+p3, then all elements in E1 can be discerned from p2 elements in E2 and also all elements in E1 can be discerned from the other p3 elements in E2 by P, that is, W(p1,p2+p3)= W(p1,p2)+ W(p1,p3). Theorem 1. If the domain U is divided into m equivalence classes by some feature set P, and the element number of each equivalence class is n1,n2, …nm, then the knowledge quantity of P is: W (p1,p2,…,pm)=c
∑
p i × p j (Here c is constant
1≤ i < j ≤ m
parameter,the number of elements in U is n,
c= n2 W ( 1 , 1 ) ) n n
Proof. (omit) E.G. From table 1 we estimate z
WX = c
∑p
WY = c
∑p
1≤i < j ≤3
z
1≤i < j ≤ 2
3.3
i
× p j = 0.5 × 0.25c + 0.5 × 0.25c + 0.25 × 0.25c = 0.3125c
i
× p j = 0.25c
Specific conditional knowledge quantity
The measurement of information called entropy is the measurement of the uncertainty, according to the entropy, a non-periodical literature [12] gets a series of definition, such as specific conditional entropy, conditional entropy and information gain. We follow this way to get specific conditional knowledge quantity, conditional knowledge quantity and knowledge gain. Definition 2. The object domain is U, set P and set D are two attribute(feature) sets, v is a specific value of P, definition of specific conditional knowledge quantity as WD / P =v : WD / P =v =The knowledge quantity of D among only those records in which P has value v. E.G. From table 1 we estimate: z WY / X = Math = c ∑ p × p = 0 . 5 × 0 . 5 c = 0 . 25 c i j 1≤ i < j ≤ 2
z
WY / X =History = 0
z
WY / X =CS = 0
3.4
Conditional knowledge quantity
Definition 3. The object domain is U, set P and set D are two attribute(feature) sets, vj is a specific value of P, definition of conditional knowledge quantity as W D / P :
WD / P = ∑ j prob( P = v j )WD / P=v j .
E.G. From table 1 we estimate: Table 2.
z
vj
Prob(X= vj)
WY / X = v j
Math History CS
0.5 0.25 0.25
0.25c 0 0
WX / Y = ∑ j prob( X = v j )WY / X = v j = 0.5*0.25c+0.25*0+0.25*0=0.125c
3.5
Knowledge gain
Definition 4. The object domain is U, set P and set D are two attribute(feature) sets, definition of knowledge gain as KG( D | P) : KG( D | P) = WD − WD / P E.G. From table 1 we estimate: z WX = 0.3125c z
WY / X = 0.125c
z
KG (Y | X ) = WX − WY / X = 0.3125c − 0.125c = 0.1875c
4
A knowledge-based feature selection method
Knowledge gain measures the amount of knowledge obtained for category prediction by knowing the presence or absence of a term in a document. Let m
{ci }im=1 denote
the set of categories in the target space. The knowledge gain of term t is defined to be: KG (t ) = KG (C | T ) = c
∑p
1≤ i < j ≤ m
i
× p j − ( p (t )
∑ p (c
1≤ i < j ≤ m
i
| t ) p ( c j | t ) + p (t )
∑ p (c
1≤ i < j ≤ m
i
| t ) p ( c j | t ))c
Given a training corpus, for each unique term the knowledge gain is computed and those terms whose knowledge gain is less than some predetermined threshold are removed from the feature space. We call this method as Knowledge-Gain method(KG).
5
Experiment results
Many feature selection methods such as DF, IG, MI, and so on have been widely used in text categorization. Existing experiments show IG is one of the most effective methods. Our objective is to compare the DF, IG and MI method with the KG method. A number of statistical classification and machine learning techniques have been applied to text categorization, we use two different classifiers, k-nearest-neighbor classifier (kNN) and Naïve Bayes classifier. We use kNN, which is one of the topperforming classifiers[13], evaluations have shown that it outperforms nearly all the other systems, and we selected Naïve Bayes because it is also one of the most efficient and effective inductive learning algorithm for classification [14]. According to [15], micro-averaging precision was widely used in Cross-Method comparisons, here we adopt it to evaluate the performance of different feature selection methods.
5.1
Data Collections
Two corpora are used in our experiments: the NewsGroup collection[17] and the OHSUMED collection[1] [16]. The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups, each corresponding to a different topic. The 20 newsgroups collection is a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. In this experiment, after we remove the unrelated data set, there are 5769 documents as a training set and the 3837 documents as the test set in this study. There are 31109 unique terms in the training set and 10 categories in this document collection. OHSUMED is a bibliographical document collection. There are about 1800 categories defined in MeSH, and 14321 categories present in the OHSUMED document collection. We used a subset of this document collection. 7445 documents as a training set and the 3729 documents as the test set in this study. There are 11465 unique terms in the training set and 10 categories in this document collection. 5.2
Results
(1-a) (1-b) Fig. 1. Average precision of KNN vs. Number of selected features on NewsGroup.
(2-a) (2-b) Fig. 2. Average precision of Naïve Bayes vs. Number of selected features on NewsGroup.
Figure 1 and Figure 2 exhibit the performance curves of kNN and Naïve Bayes on NewsGroup after feature selection DF, IG, MI and KG. We can note that KG and IG most effective in our experiments, in contrast, MI had relatively poor performance. Specially, KG performs better than the IG method, on extremely aggressive reduction, it is notable that KG outperform IG ((1-a),(2-a)).
(3-a) (3-b) Fig. 3. Average precision of KNN vs. Number of selected features on OHSUMED.
(4-a) (4-b) Fig. 4. Average precision of Naïve Bayes vs. Number of selected features on OHSUMED
Figure 3 and Figure 4 exhibit the performance curves of kNN and Naïve Bayes on OHSUMED after feature selection DF, IG, MI and KG. We also can note that KG and IG most effective in our experiments, in contrast, MI had relatively poor performance. Specially, KG performs better than IG, on extremely aggressive reduction, it is notable that KG prevalently outperform IG ((3-a),(4-a)).
6
Conclusion
Feature selection plays an important role in text categorization. Many feature selection methods such as DF, IG, MI, and so on have been widely used. Many existing experiments show IG is one of the most effective methods. in this paper: z According to Rough set theory, we give an interpretation for knowledge and
knowledge quantity. We put forward a knowledge-based feature selection method called KG. Experimental results on NewsGroup and OHSUMED corpora show that KG performs much better than MI, DF even than IG. Specially, on extremely aggressive reduction, it is notable that KG outperforms IG. Many text categorization methods are presented and some works are based on Rough Set. In text categorization, each document is described by a vector of extremely high dimensionality, so most rough set-based methods use rough set for feature selection and other technique at the same time. This paper does not use other technique, but only according to Rough set theory that knowledge is an ability to partition objects, we quantify the ability of partition objects, and call the amount of this ability as knowledge quantity, then put forward a knowledge-based feature selection method KG. z z
References 1 2 3
4 5 6 7 8 9
10 11
12 13 14 15 16 17
Yiming Yang, Jan O. Pedersen. 1997. A Comparative Study on Feature Selection in Text Categorization. Proceedings of ICML-97, pp. 412-420. Ying Liu, A Comparative Study on Feature Selection Methods for Drug Discovery, J. Chem. Inf. Comput. Sci. 2004, 44, 1823-1828 Stewart M.Yang, Xiao-Bin Wu, Zhi-Hong Deng, Ming Zhang, Dong-Qing Yang. 2002 Modification of Feature Selection Methods Using Relative Term Frequency. Proceedings of ICMLC-2002, pp. 1432-1436 J.R. Quinlan. Induction of decision trees. Machine Learning, 1(1): pp.81-106, 1986 Tom Mitchell. Machine Learning. McCraw Hill, 1996 Pawlak Z. Rough Sets. International Journal of Computer and Information Science, 1982, 11(5): 341-356 Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A. 1999. Rough sets: A tutorial. A New Trend in Decision-Making, Springer-Verlag, Singapore, 3-98 Pawlak Z. Grzymala-Busse J, Nelson D E, etal. Rough Sets, Communications of the ACM, 1995, 38(11): 89-95 A rough set-based hybrid feature selection method for topic-specific text filtering, Proceedings of the Third International Conference on Machine Learning and Cybernetics, August, 2004. A Rough Set-Based Hybrid Method to Text Categorization. WISE (1) 2001: 254-261 Chouchoulas and Q. Shen. A Rough Set-Based Approach to Text Classification. Proceedings of the 7th International Workshop on Rough Sets (Lecture Notes in Artificial Intelligence, No. 1711), pages 118-127, 1999 Andrew Moore. Statistical Data Mining Tutorials. http://www.autonlab.org/tutorials/ Yiming Yang, Xin Liu. A re-examination of text categorization methods. (SIGIR’99), pp. 42-49, 1999 H. Zhang. The optimality of naive Bayes. The 17th International FLAIRS conference, Miami Beach, May 17-19, 2004. Yiming Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, Vol 1, No. 1/2, pp 67–88, 1999 OHSUMED http://www.cs.umn.edu/%CB%9Chan/data/tmdata.tar.gz NewsGroup http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html.