New stemming for arabic text classification using feature selection and decision trees Said Bahassine
Mohamed Kissi
Abdellah Madani
EMMID, Department of Computer Chouaib Doukkali University Faculty of Science B.P. 20, 24000 El Jadida, Morocco
[email protected]
EMMID, Department of Computer Chouaib Doukkali University Faculty of Science B.P. 20, 24000 El Jadida, Morocco
[email protected]
MATIC, Department of Computer Chouaib Doukkali University Faculty of Science B.P. 20, 24000 El Jadida, Morocco
[email protected]
The remainder of the paper is organized as follows: section 2 discusses previous works in Arabic text classification. Related works in Arabic stem root extraction and our new algorithm are presented in section 3. Experimental results are presented in section 4, and then we draw some conclusions and provide suggestions for future research.
Abstract—In this paper we conduct a comparative study between two stemming algorithms: khoja stemmer and our new stemmer for Arabic text classification (categorization), using Chisquare statistics as feature selection and focusing on decision tree classifier. Evaluation used a corpus that consists of 5070 documents independently classified into six categories: sport, entertainment, business, middle east, switch and world, on WEKA toolkit. The recall measure is used to compare the performance of these methods. Results show that text classification using our new stemmer outperforms classification using Khoja stemmer.
II. RELATED WORKS A wide variety of studies have been implemented to solve the problem of text classification. Most ,of these works were performed for English and French texts [3], but few ones have been applied to Arabic text.
Keywords—Arabic Text classification; Stemming; Decision tree; Chi-square;
Al-Shalabi implemented a text classification system for Arabic language [4], the implemented system uses unigrams and bigrams as a future extraction method in the preprocessing step of the text classification system design procedure, measured Term Frequency Invers Document Frequency (TFIDF) as a selection method of these characteristics and KNearest Neighbors (KNN) classification model for Arabic text classification.
I. INTRODUCTION With the fast growth of online Arabic documents, there is a growing need for text classification in order to facilitate research of information on the net and to find interesting one in it. Text classification is defined as the assignment of unclassified text document to one or more predefined appropriate categories based on their content. Text classification has been used in digital library systems, spam filtering, document management systems, web pages classification, sentiment analysis for marketing and classification of email messages [1].
Mesleh has implemented the Support Vector Machines (SVM) algorithm using chi-square as a feature selection method in the preprocessing step [5]. He evaluated the performance of his classifier by an in-house corpus collected from online Arabic newspaper archives, including Aljazeera, Al Nahar, Al Hayat, Al Ahram and AlDostor as well as a few other specialized websites. The collected corpus contains 1445 documents that vary in length. These documents fall into nine categories. Mesleh has concluded that the SVM algorithm with the chi-square method outperform Naïve Bayesian and the KNN classifier in terms of F-measure.
Arabic is the 5th widely used language in the world. It is spoken by more than 422 million people as a first language and by 54 million native speakers. It is also the language of the Koran. There are 1,6 billion believers who practice their prayers in that language [2] . It contains 28 letters called Huruf Alhijaa, 25 consonants and three long vowels that are written from right to left and shape according to their position in the word.
Al-Harbi et al. have evaluated the performance of two popular classification algorithms: C5.0 decision tree algorithm and SVM algorithm, on classifying Arabic text using seven Arabic corpora: Saudi News Agency, Saudi News Paper, website, writers, Discussions, Islamic topic and Arabic Poems, they have implemented a tool for Arabic text classification to accomplish selection and feature extraction [6]. The experimental results showed that the C5.0 algorithm has outperformed SVM classifier in terms of accuracy by about
Arabic has a rich morphology, a complex syntax, complex semantics and very complex grammatical ruleswhich distinguish it from other languages and make its learning, analysis and automatic processing difficult.. In this paper, we compare and contrast the impact of two stemming methods khoja stemmer and our approach stemmer, using square statistics on feature selection in text classification using decision tree.
200
Khoja’s algorithm is one of the most used morphological stemming algorithms [13]. It removes the largest suffix and the largest prefix from the word, and then the algorithm extracts the root by comparing the rest of the word with its verbal and noun patterns.
10%, the SVM average accuracy is 68.65%, while the average accuracy for the C5.0 is 78.42%. Naïve Bayes algorithm has been used by El-Kourdi et al. for Arabic text classification [7]. They used a corpus of 1500 text documents collected from Al Jazeera website belonging to 5 categories: Sport, Business, Culture and Art, Science and Health, each category contains 300 text documents. All words in the documents were converted to their roots. The results showed that the average accuracy was about 68.78% in cross validation and 62% in evaluation set experiments.The best accuracy by category was 92.8%.
Sawalha conducted a comparitive study of three stemming algorithms: Khoja’s stemmer, Buckwalter’s morphological analyser and Al-Shalabi algorithm [14]. The results obtained showed that Khoja stemmer performs better in term of accuracy. Therefore, we will further look at this stemmer and we will compare the results.
Abu-Errub has classified Arabic text documents using TFIDF measurement for categorization in which a document is compared with pre-defined document categories based on its content, then the document is classified into the appropriate sub-category using chi-Square measure [8]. Abu-Errub has evaluated the performance of his algorithm using 1090 testing documents categorized into ten main categories and 50 subcategories. The results showed that the best accuracy by category was 98.93%.
The Khoja stemmer follows this procedure: 1) 2) 3) 4) 5) 6) 7)
Houssien et al. compared three classification algorithms on Arabic text [9]. The three algorithms were Sequential Minimal Optimization (SMO), Naïve Bayesian (NB) and J48(C4.5) using Weka. The collected corpus contained 2363 documents that vary in length and fall into six categories: Sport, Economic, medicine, politic, religion and science. The authors used elimination stop words and normalization approach to reduce the number of features extracted from the documents. The recall, precision and error rate were used to compare the accuracy of classifiers. The results indicate that SMO classifier achieves the highest accuracy and the lowest error rate, followed by J48 (C4.5), and the NB classifier; whereas J48 classifier took a highest amount of time to get the results, followed by NB classifier then SMO classifier.
8) 9) 10) 11)
Remove diacritics Remove stop words, punctuation and numbers Remove definite article ( بال, كال, فال, وال,)ال Remove conjunction ( )و Remove suffixes Remove prefixes Compare result against a list of patterns. If a match is found extract the root. Check the match with the predefined root based. Replace: ا, و, يbyو Replace: ؤ, ئby أ If the root contains only two letters, check if they should contain a double character
B. A new stemmer approach: Although, Khoja’s algorithm had the highest accuracy, it still suffers from several weaknesses [15]. For instance, the Khoja stemmer removes definite articles, conjunctions, prefixes or suffixes which can occasionally be part of the word’s root. For instance, the word « ( » والدانparents) is stemmed to « » دونinstead of ( ولدson). TABLE I. AFFIXES USED IN OUR ALGORITHM
Most of the previous work viewed text as Bag-Of-Token (BOT). In the Arabic language, words have multiple morphological structures. (for example, "("درسstudy), " ("مدرسteacher), "( "مدرسةschool) and "( "دراسةstudy)) . In most cases, these variants have similar semantic features and belong to the same category , we use 4 attributes (“”درس, “”مدرس, “”مدرسة, “ )”دراسةinstead of 1 (“)”درس. To overcome this shortcoming we use stemming before classification. The use of stemming reduces the number of attributes. The accuracy of stemmed terms is better than BOT for all preprocessing classifications [10].
Affixes in Arabic Length 1 prefixes Length 2 prefixes Length 3 prefixes Length 4 prefixes Length 1 suffixes Length 2 suffixes
III. STEMMING ALGORITHM
Length 3 suffixes
Stemming is the process of reducing inflected words into their root form. It removes prefixes, suffixes and infixes. There are several types of stemming algorithms [11]: statistical, dictionary, morphological and light stemming.
Length 4 suffixes
Examples
ا,ن, ت, ي, و, س, ف, ب,ل , وي,وب, وا, با,لي,ال,يت,ين,ست,سي,ال,فل,ول,او,ون لل, وس,وت , ولل, است,تست,يست,سيت,اال,الت,لال, بال,كال للت,وال,فسي واست,وتست,وسيت,واال,المت,ولال, وبال,وكال, ,وفسي وللت ن, ا,ت, ك, ي, ه,ة , وه,كه, ها, يا, نا, هن, كم, يه, ية,تي, تن, ين, ان, ات,ون هم, ما, وا, ني,ته, كن,تم , وني, ونا, وها,وهم, كمل, يها, هما, تين, تان,تنا, همل,تها اتي, اتك, اته, يين, كما,يات اتيه, اتهن,اتها, اتهم, اتكن,اتنا,تهما
Our stemmer tries to overcome this shortcoming For this reason, the proposed algorithm differs from Khoja’s in the following terms:
A. Previous works: One of the morphological stemming algorithms is tri-literal root extraction algorithm also referred to as Al-Shalabi algorithm [12]. This algorithm uses letter weights for a word‘s letters multiplied by the letter‘s position in the word. Consonants were assigned a weight of zero and different weights were assigned to the letters grouped in the word ()سألتمونيها. All affixes letters are distinguished by their weights.
201
It verifies whether the affixes are part of the word before removing them.
It uses more affixes as shown in table 1.
It uses an enriched stop words file that can increase the accuracy in text categorization [16].
At the first entry of the word "( "والدانsee algorithm), we compute the length of the word, n = 6, if the length is inferior to 4, we look in the unit M and V. If not, we look for the pattern of the word but we don’t find it. We apply the (n-1) gram, the results are " "والداand ""الدان.
We process the first word ""الدان. Then,we test if the first character of the word " "والدانis a prefix. If yes, we search for the root of the word ""الدان, the length of the word is 5, we search the pattern that corresponds to the word in Pattern (5), we find three patterns: Pattern “ ”فعالنthe root is “”الد, it is not found in the sets M and V. Pattern “ افعللthe root is “”لدان, it is not found in the sets M and V. Pattern “ ”افعالthe root is “”لدان, it is not found in the sets M and V.
Algorithm: Pattern(i) = set of the Pattern the length i P(i)= set of the prefixes the length i S(i)= set of the suffixes the length i V = set of the root of the verbs
We process the second word ""والداand we test if the last character removed off the word " "والدانis a suffix. If yes, we search the root of the word ""والدا. The length is 5. We look for the pattern that corresponds to the word in Pattern (5), we did not it. We do the same thing, but this time we apply (n-2) gram (4gram), we find three words: " "والد," "لدانand ""الدا.
M = set of the root of the words R = set of the root ( results) F=input File F Cleanup(F)
// Remove diacritics, stop words, //Punctuation and numbers
L ToList(F)
We test if the first two characters of the word " "والدانis a prefix. If yes, we search the root of the word ""لدان, the length of the word is 4. We search the pattern that corresponds to the word in Pattern (4), we find "" فعال, the root is" " لدن, it is not found in the sets M and V. For the second word ""الدا, we test if the first character of the word " "والدانis a prefix and if the last is a suffix, if yes, we search the root of the word ""الدا, the length of this word is 4, we search for the pattern that corresponds to the word in Pattern (4), we don’t fnd find it. For the third word ""والد, we test if the last two characters of the word " "والدانis a suffix, if yes, we search for the root of the word ""والد, the length of this word is 4, we look for the pattern that corresponds to the word in Pattern (4), we find “ ”فاعلthe root is “”ولد, the word is found in Arabic words set M. And henceforth, we apply (n-3) gram etc… we stop at (n-6) gram and the length of the word is higher than or equal to 3, in our case the length of the word is 6, we stop at (n-3) gram.
// convert text to list using space as split
for Mot in L do If Mot ϵ L do n length(Mot) tn if Mot ϵ V or Mot ϵ M return Mot while ( t-n=2) i 0 while(i3 and n