Computer Engineering,. Tafila Technical ... online databases, and online news. Although a lot ..... Dr. Hawashin received his B.S. in Computer Science from The.
International Journal of Computer Applications (0975 – 8887) Volume 83 – No.17, December 2013
An Efficient Feature Selection Method for Arabic Text Classification Bilal Hawashin
Ayman M Mansour
Shadi Aljawarneh
Department of Computer Information Systems,
Department of Electrical and Computer Engineering, Tafila Technical University Tafila 66110, Jordan,
Department of Software Engineering. Al-Isra University, Amman 11622, Jordan,
Alzaytoonah University of Jordan,
Amman 11733, Jordan,
ABSTRACT This paper proposes an efficient, Chi-Square-based, feature selection method for Arabic text classification. In Data Mining, feature selection is a preprocessing step that can improve the classification performance. Although few works have studied the effect of feature selection methods on Arabic text classification, limited number of methods was compared. Furthermore, different datasets were used by different works. This paper improves the previous works in three aspects. First, it proposes a new efficient feature selection method for enhancing Arabic text classification. Second, it compares extended number of existing feature selection methods. Third, it adopts two publicly available datasets to encourage future works to adopt them in order to guarantee fair comparisons among the various works. Our experiments show that our proposed method outperformed the existing methods in term of accuracy.
Keywords Data Mining, Arabic Text Retrieval, Feature Selection, CHI Square.
1. INTRODUCTION Text Classification is a data mining application that automatically assigns one or more predefined labels to free text items based on their content[9]. Currently, the amount of the available text data in the web is increasing daily. This huge size makes the process of classifying it manually a very difficult and time consuming task. Therefore, the trend of automatically classifying text data has been introduced. Text Classification is used in many fields, such as filtering emails, digital libraries, online databases, and online news. Although a lot of works have studied classification of English texts, few works have studied the classification of Arabic texts. For example, [13] studied the performance of C5.0 and Support Vector Machines classifiers on Arabic texts, where the latter outperformed the former with accuracies 0.78 and 0.69 respectively. [12] evaluated Naïve Bayes on classifying Arabic web texts, and its accuracy was 0.68. [4] investigated the performance of CBA, Naïve Bayes, and SVM on classifying Arabic texts. The results showed that CBA outperformed NB and SVM and its accuracy was 0.8. [10] compared the performance of SVM and KNN on Arabic texts. SVM outperformed KNN. As text items are represented using a term document matrix, where every row represents a term, and every column represents a text item, the original number of terms could be huge, which could negatively affect the classification performance. Therefore, one of the important preprocessing steps in the text classification, and in data mining applications generally, is the feature selection.
In this step, only the important terms are selected, which would reduce the space consumption and can improve classification accuracy by eliminating the noisy terms. Few works have studied the effect of feature selection on Arabic text classification. For example, [4] studied the effect of the maximum entropy method in classifying Arabic texts, and its accuracy was 0.80. [2] showed that SVM classifier in combination with Chi-square-based feature selection is an appropriate method to classify Arabic texts. [5] evaluated the effect of Ngram frequency statistics on classifying Arabic texts. [14] compared TF.IDF, DF, LSI, Stemming, and Light Stemming. Their work showed that the former three methods outperformed the latter two stemming methods. In most of the previous works, limited number of methods was used. Besides, different works used different datasets, which made the comparison of their methods difficult. Furthermore, the sizes of the used datasets were rather small, which could affect the experiment results. In this paper, the previous works are extended by comparing more feature selection methods. Two publicly available datasets were used in order to make the different works comparable. Furthermore, an improved Chisquare -based method will be proposed and its effect in the Arabic text classification performance will be analyzed. In this paper, our proposed method will be compared with the regular Chi-square statistics[16], Information Gain[15], Mean TF.IDF [12], DF[16], Wrapper Approach with SVM Classifier[11], Feature Subset Selection[7], and a CHI square variant. Most of these methods have strong theoretical foundations and have proved their superiority in feature selection of English texts. In order to evaluate their performance, two publicly available datasets, Akhbar Alkhalij and Alwatan datasets[1] were used. SVM classifier is used to classify the texts after the feature selection process. The contributions of this work are as follows. Proposing a new improved Chi-square-based method. Extending previous works by comparing more existing feature selection methods according to their performance in classifying Arabic texts. Adopting the use of two publicly available datasets in an attempt to make different works comparable. In what follows, various existing feature selection methods will be described to be compared (Compared Feature Selection Methods section), our proposed improved method (Improved Chi-square-Based Feature Selection Method section), illustrates phase one of the experimental part (Comparing Regular And
1
International Journal of Computer Applications (0975 – 8887) Volume 83 – No.17, December 2013 Improved Chi-square Methods section), which compares this method with the regular Chi-square method and another Chisquare variant, phase two of the experimental part ( Comparing Our Method With Existing Feature Selection Methods section), which compares our method with various existing feature selection methods according to their effect in classifying Arabic texts, and conclusion (Conclusion section).
Various feature selection methods which will be compared with our method in the experimental parts are described in this section. In this context, the words feature and term are used interchangeably.
2.1 Chi-square Chi-square is a well known statistic measurement that has been used in feature selection [16]. This method assigns a numerical value for each term that appears at least once in any document. The value of a term w is calculated as follows:
(n pt nnt n pt nnt )(n pt nnt n pt nnt ) 2 (n pt n pt _ )(nnt nnt )(n pt nnt )(n pt nnt _ )
Here, each term is valued according to the number of documents that contains this term. The more the documents that contain this term, the more the DF value, and the more its importance.
DF(w) = nw ,
(4)
where nw is the number of training documents that contains the term w.
2. COMPARED FEATURE SELECTION METHODS
Val(w)=
2.3 Document Frequency (DF)
2.4 Information Gain (IG) Information Gain [15] is a probability based feature selection method that uses the following formula.
IG = H(Class) – H(Class|Feature), where H(Class) = –
(5)
P(Class ) log P(Class )
Classi Class
i
i
(6)
and H(Class|Feature) = ,
(1) –
Where npt+ and nnt+ are the number of text documents in the positive category and the negative category respectively in which term w appears at least once. The positive and negative categories are used to find the accuracy measurements per class when multiple classes are used such that the positive category indicates a class and the negative category indicates the remaining classes. npt- and nnt- are the number of text documents in the positive category and the negative category respectively in which the term w does not occur. The value of each term represents its importance. The terms with the highest values are the most important terms.
P(Ft ) P(Class
Ft i Feature
i
Classi Class
i
| Ft i ) log( P(Class i | Ft i )) ,
(7)
Where P(Classi) is the probability of Class i , P(Fti) is the probability of Feature i, and P(Classi|Fti) is the probability of Class i given Feature i.
2.5 Feature Subset Selection(FSS) This method [3] evaluates the importance of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them. Subsets of features that are highly correlated with the class while having low intercorrelation are preferred.
2.2 Mean TF.IDF
ccording to this method [12], term document matrix is constructed for the training set, and TF.IDF weighting method is used to weight each term in each training document. TF.IDF of the term w in document d is calculated as follows.
TF.IDF(w,d)=log(tfw,d+1).log(idfw),
(2)
where tfw,d is the frequency of the term w in document d, idfw is
N nw
, where N is the number of training documents, and nw is the
2.6 Wrapper Approach This method [8] evaluates feature sets by using a learning method. Cross validation is used to estimate the accuracy of the learning scheme for a set of attributes. This method could improve the classification accuracy but with a significant increase in the feature selection time.
3. IMPROVED CHI-SQUARE-BASED FEATURE SELECTION METHOD This section presents our Chi-square-based improved feature selection method.
number of training documents that contains the term w. Later, Mean TF.IDF is calculated for each term using the following equation.
Val(w)=
d
TF.IDF ( w, d ) Count (d )
(3)
Where Count(d) is the total number of documents in the dataset. The features with higher Mean TF.IDF values are selected.
2
International Journal of Computer Applications (0975 – 8887) Volume 83 – No.17, December 2013
Algorithm1: CHI-SQUARE BASED EQUAL CLASS FEATURE SELECTION METHOD Input: Term Document matrix TD represents the training set of D documents, T terms, and C classes. The requested number of reduced features R