New stemming for arabic text classification using ... - Semantic Scholar

New stemming for arabic text classification using feature selection and decision trees Said Bahassine

Mohamed Kissi

Abdellah Madani

EMMID, Department of Computer Chouaib Doukkali University Faculty of Science B.P. 20, 24000 El Jadida, Morocco [email protected]

EMMID, Department of Computer Chouaib Doukkali University Faculty of Science B.P. 20, 24000 El Jadida, Morocco [email protected]

MATIC, Department of Computer Chouaib Doukkali University Faculty of Science B.P. 20, 24000 El Jadida, Morocco [email protected]

The remainder of the paper is organized as follows: section 2 discusses previous works in Arabic text classification. Related works in Arabic stem root extraction and our new algorithm are presented in section 3. Experimental results are presented in section 4, and then we draw some conclusions and provide suggestions for future research.

Abstract—In this paper we conduct a comparative study between two stemming algorithms: khoja stemmer and our new stemmer for Arabic text classification (categorization), using Chisquare statistics as feature selection and focusing on decision tree classifier. Evaluation used a corpus that consists of 5070 documents independently classified into six categories: sport, entertainment, business, middle east, switch and world, on WEKA toolkit. The recall measure is used to compare the performance of these methods. Results show that text classification using our new stemmer outperforms classification using Khoja stemmer.

II. RELATED WORKS A wide variety of studies have been implemented to solve the problem of text classification. Most ,of these works were performed for English and French texts [3], but few ones have been applied to Arabic text.

Keywords—Arabic Text classification; Stemming; Decision tree; Chi-square;

Al-Shalabi implemented a text classification system for Arabic language [4], the implemented system uses unigrams and bigrams as a future extraction method in the preprocessing step of the text classification system design procedure, measured Term Frequency Invers Document Frequency (TFIDF) as a selection method of these characteristics and KNearest Neighbors (KNN) classification model for Arabic text classification.

I. INTRODUCTION With the fast growth of online Arabic documents, there is a growing need for text classification in order to facilitate research of information on the net and to find interesting one in it. Text classification is defined as the assignment of unclassified text document to one or more predefined appropriate categories based on their content. Text classification has been used in digital library systems, spam filtering, document management systems, web pages classification, sentiment analysis for marketing and classification of email messages [1].

Mesleh has implemented the Support Vector Machines (SVM) algorithm using chi-square as a feature selection method in the preprocessing step [5]. He evaluated the performance of his classifier by an in-house corpus collected from online Arabic newspaper archives, including Aljazeera, Al Nahar, Al Hayat, Al Ahram and AlDostor as well as a few other specialized websites. The collected corpus contains 1445 documents that vary in length. These documents fall into nine categories. Mesleh has concluded that the SVM algorithm with the chi-square method outperform Naïve Bayesian and the KNN classifier in terms of F-measure.

Arabic is the 5th widely used language in the world. It is spoken by more than 422 million people as a first language and by 54 million native speakers. It is also the language of the Koran. There are 1,6 billion believers who practice their prayers in that language [2] . It contains 28 letters called Huruf Alhijaa, 25 consonants and three long vowels that are written from right to left and shape according to their position in the word.

Al-Harbi et al. have evaluated the performance of two popular classification algorithms: C5.0 decision tree algorithm and SVM algorithm, on classifying Arabic text using seven Arabic corpora: Saudi News Agency, Saudi News Paper, website, writers, Discussions, Islamic topic and Arabic Poems, they have implemented a tool for Arabic text classification to accomplish selection and feature extraction [6]. The experimental results showed that the C5.0 algorithm has outperformed SVM classifier in terms of accuracy by about

Arabic has a rich morphology, a complex syntax, complex semantics and very complex grammatical ruleswhich distinguish it from other languages and make its learning, analysis and automatic processing difficult.. In this paper, we compare and contrast the impact of two stemming methods khoja stemmer and our approach stemmer, using square statistics on feature selection in text classification using decision tree.

200

Khoja’s algorithm is one of the most used morphological stemming algorithms [13]. It removes the largest suffix and the largest prefix from the word, and then the algorithm extracts the root by comparing the rest of the word with its verbal and noun patterns.

10%, the SVM average accuracy is 68.65%, while the average accuracy for the C5.0 is 78.42%. Naïve Bayes algorithm has been used by El-Kourdi et al. for Arabic text classification [7]. They used a corpus of 1500 text documents collected from Al Jazeera website belonging to 5 categories: Sport, Business, Culture and Art, Science and Health, each category contains 300 text documents. All words in the documents were converted to their roots. The results showed that the average accuracy was about 68.78% in cross validation and 62% in evaluation set experiments.The best accuracy by category was 92.8%.

Sawalha conducted a comparitive study of three stemming algorithms: Khoja’s stemmer, Buckwalter’s morphological analyser and Al-Shalabi algorithm [14]. The results obtained showed that Khoja stemmer performs better in term of accuracy. Therefore, we will further look at this stemmer and we will compare the results.

Abu-Errub has classified Arabic text documents using TFIDF measurement for categorization in which a document is compared with pre-defined document categories based on its content, then the document is classified into the appropriate sub-category using chi-Square measure [8]. Abu-Errub has evaluated the performance of his algorithm using 1090 testing documents categorized into ten main categories and 50 subcategories. The results showed that the best accuracy by category was 98.93%.

The Khoja stemmer follows this procedure: 1) 2) 3) 4) 5) 6) 7)

Houssien et al. compared three classification algorithms on Arabic text [9]. The three algorithms were Sequential Minimal Optimization (SMO), Naïve Bayesian (NB) and J48(C4.5) using Weka. The collected corpus contained 2363 documents that vary in length and fall into six categories: Sport, Economic, medicine, politic, religion and science. The authors used elimination stop words and normalization approach to reduce the number of features extracted from the documents. The recall, precision and error rate were used to compare the accuracy of classifiers. The results indicate that SMO classifier achieves the highest accuracy and the lowest error rate, followed by J48 (C4.5), and the NB classifier; whereas J48 classifier took a highest amount of time to get the results, followed by NB classifier then SMO classifier.

8) 9) 10) 11)

Remove diacritics Remove stop words, punctuation and numbers Remove definite article ( ‫ بال‬,‫ كال‬,‫ فال‬,‫ وال‬,‫)ال‬ Remove conjunction ( ‫)و‬ Remove suffixes Remove prefixes Compare result against a list of patterns. If a match is found extract the root. Check the match with the predefined root based. Replace: ‫ ا‬,‫ و‬,‫ ي‬by‫و‬ Replace: ‫ ؤ‬,‫ ئ‬by ‫أ‬ If the root contains only two letters, check if they should contain a double character

B. A new stemmer approach: Although, Khoja’s algorithm had the highest accuracy, it still suffers from several weaknesses [15]. For instance, the Khoja stemmer removes definite articles, conjunctions, prefixes or suffixes which can occasionally be part of the word’s root. For instance, the word « ‫( » والدان‬parents) is stemmed to « ‫ » دون‬instead of ‫( ولد‬son). TABLE I. AFFIXES USED IN OUR ALGORITHM

Most of the previous work viewed text as Bag-Of-Token (BOT). In the Arabic language, words have multiple morphological structures. (for example, "‫("درس‬study), " ‫("مدرس‬teacher), "‫( "مدرسة‬school) and "‫( "دراسة‬study)) . In most cases, these variants have similar semantic features and belong to the same category , we use 4 attributes (“‫”درس‬, “‫”مدرس‬, “‫”مدرسة‬, “‫ )”دراسة‬instead of 1 (“‫)”درس‬. To overcome this shortcoming we use stemming before classification. The use of stemming reduces the number of attributes. The accuracy of stemmed terms is better than BOT for all preprocessing classifications [10].

Affixes in Arabic Length 1 prefixes Length 2 prefixes Length 3 prefixes Length 4 prefixes Length 1 suffixes Length 2 suffixes

III. STEMMING ALGORITHM

Length 3 suffixes

Stemming is the process of reducing inflected words into their root form. It removes prefixes, suffixes and infixes. There are several types of stemming algorithms [11]: statistical, dictionary, morphological and light stemming.

Length 4 suffixes

Examples

‫ ا‬,‫ن‬,‫ ت‬,‫ ي‬,‫ و‬,‫ س‬,‫ ف‬,‫ ب‬,‫ل‬ ,‫ وي‬,‫وب‬,‫ وا‬,‫ با‬,‫لي‬,‫ال‬,‫يت‬,‫ين‬,‫ست‬,‫سي‬,‫ال‬,‫فل‬,‫ول‬,‫او‬,‫ون‬ ‫ لل‬,‫ وس‬,‫وت‬ ,‫ ولل‬,‫ است‬,‫تست‬,‫يست‬,‫سيت‬,‫اال‬,‫الت‬,‫لال‬,‫ بال‬,‫كال‬ ‫ للت‬,‫وال‬,‫فسي‬ ‫ واست‬,‫وتست‬,‫وسيت‬,‫واال‬,‫المت‬,‫ولال‬,‫ وبال‬,‫وكال‬, ,‫وفسي‬ ‫وللت‬ ‫ ن‬,‫ ا‬,‫ت‬,‫ ك‬,‫ ي‬,‫ ه‬,‫ة‬ ,‫ وه‬,‫كه‬,‫ ها‬,‫ يا‬,‫ نا‬,‫ هن‬,‫ كم‬,‫ يه‬,‫ ية‬,‫تي‬,‫ تن‬,‫ ين‬,‫ ان‬,‫ ات‬,‫ون‬ ‫ هم‬,‫ ما‬,‫ وا‬,‫ ني‬,‫ته‬,‫ كن‬,‫تم‬ ,‫ وني‬,‫ ونا‬,‫ وها‬,‫وهم‬,‫ كمل‬,‫ يها‬,‫ هما‬,‫ تين‬,‫ تان‬,‫تنا‬,‫ همل‬,‫تها‬ ‫ اتي‬,‫ اتك‬,‫ اته‬,‫ يين‬,‫ كما‬,‫يات‬ ‫ اتيه‬,‫ اتهن‬,‫اتها‬, ‫ اتهم‬,‫ اتكن‬,‫اتنا‬,‫تهما‬

Our stemmer tries to overcome this shortcoming For this reason, the proposed algorithm differs from Khoja’s in the following terms:

A. Previous works: One of the morphological stemming algorithms is tri-literal root extraction algorithm also referred to as Al-Shalabi algorithm [12]. This algorithm uses letter weights for a word‘s letters multiplied by the letter‘s position in the word. Consonants were assigned a weight of zero and different weights were assigned to the letters grouped in the word (‫)سألتمونيها‬. All affixes letters are distinguished by their weights.

201



It verifies whether the affixes are part of the word before removing them.



It uses more affixes as shown in table 1.



It uses an enriched stop words file that can increase the accuracy in text categorization [16].

At the first entry of the word "‫( "والدان‬see algorithm), we compute the length of the word, n = 6, if the length is inferior to 4, we look in the unit M and V. If not, we look for the pattern of the word but we don’t find it. We apply the (n-1) gram, the results are "‫ "والدا‬and "‫"الدان‬.

We process the first word "‫"الدان‬. Then,we test if the first character of the word "‫ "والدان‬is a prefix. If yes, we search for the root of the word "‫"الدان‬, the length of the word is 5, we search the pattern that corresponds to the word in Pattern (5), we find three patterns:  Pattern “‫ ”فعالن‬the root is “‫”الد‬, it is not found in the sets M and V. Pattern “‫ افعلل‬the root is “‫”لدان‬, it is not found in the sets M and V.  Pattern “‫ ”افعال‬the root is “‫”لدان‬, it is not found in the sets M and V.

Algorithm: Pattern(i) = set of the Pattern the length i P(i)= set of the prefixes the length i S(i)= set of the suffixes the length i V = set of the root of the verbs

We process the second word "‫"والدا‬and we test if the last character removed off the word "‫ "والدان‬is a suffix. If yes, we search the root of the word "‫"والدا‬. The length is 5. We look for the pattern that corresponds to the word in Pattern (5), we did not it. We do the same thing, but this time we apply (n-2) gram (4gram), we find three words: "‫ "والد‬,"‫ "لدان‬and "‫"الدا‬.

M = set of the root of the words R = set of the root ( results) F=input File F  Cleanup(F)

// Remove diacritics, stop words, //Punctuation and numbers

L ToList(F)

We test if the first two characters of the word "‫ "والدان‬is a prefix. If yes, we search the root of the word "‫"لدان‬, the length of the word is 4. We search the pattern that corresponds to the word in Pattern (4), we find "‫" فعال‬, the root is" ‫" لدن‬, it is not found in the sets M and V. For the second word "‫"الدا‬, we test if the first character of the word "‫ "والدان‬is a prefix and if the last is a suffix, if yes, we search the root of the word "‫"الدا‬, the length of this word is 4, we search for the pattern that corresponds to the word in Pattern (4), we don’t fnd find it. For the third word "‫"والد‬, we test if the last two characters of the word "‫ "والدان‬is a suffix, if yes, we search for the root of the word "‫"والد‬, the length of this word is 4, we look for the pattern that corresponds to the word in Pattern (4), we find “‫ ”فاعل‬the root is “‫”ولد‬, the word is found in Arabic words set M. And henceforth, we apply (n-3) gram etc… we stop at (n-6) gram and the length of the word is higher than or equal to 3, in our case the length of the word is 6, we stop at (n-3) gram.

// convert text to list using space as split

for Mot in L do If Mot ϵ L do n  length(Mot) tn if Mot ϵ V or Mot ϵ M return Mot while ( t-n=2) i 0 while(i3 and n

New stemming for arabic text classification using ... - Semantic Scholar

New stemming for arabic text classification using ... - Semantic Scholar

Suggest Documents

An efficient stemming for Arabic Text Classification

The Effect of Stemming on Arabic Text Classification - Semantic Scholar

ARABIC TEXT CLASSIFICATION USING NEW STEMMER FOR ...

Arabic Text Classification Using Support Vector ... - Semantic Scholar

Arabic Text Classification Using Maximum Entropy - CiteSeerX

Enhancement of Arabic Text Classification Using Semantic Relations ...

Arabic Text Classification: The Effect of the AWN ... - Semantic Scholar

NADA: New Arabic Dataset for Text Classification - The Science and

Text Classification using Artificial Intelligence - Semantic Scholar

Cancer Hallmark Text Classification Using ... - Semantic Scholar

Fast Text Classification Using Sequential ... - Semantic Scholar

Automatic Arabic Text Classification - CiteSeerX

Accurate Stemming of Dutch for Text Classification - CiteSeerX

Simple Stemming Rules for Arabic Language

Emotion Classification in Arabic Poetry using ... - Semantic Scholar

Accurate Stemming of Dutch for Text Classification - CiteSeerX

Resources for Urdu Text Stemming

A Novel Arabic Text Steganography Method Using ... - Semantic Scholar

arabic text categorization algorithm using vector ... - Semantic Scholar

Classifying Arabic Text Using KNN Classifier - Semantic Scholar

A new hybrid stemming algorithm for Persian - Semantic Scholar

Topics Classification of Arabic Text in Quran by using

Arabic Text Classification Algorithm using TFIDF and Chi Square ...

Arabic Text Classification using K-Nearest Neighbour Algorithm