The Effect of Preprocessing on Arabic Document Categorization - MDPI

8 downloads 6625 Views 5MB Size Report
Apr 18, 2016 - School of Information Science and Engineering, Central south ... abdullah_ayedh@csu.edu.cn (A.A.); k_alwesabi@csu.edu.cn (K.A.) ...... History, Sport, Religion, Social, Nutriment, Law, and Technology that vary in length and.
algorithms Article

The Effect of Preprocessing on Arabic Document Categorization Abdullah Ayedh 1 , Guanzheng TAN 1, *, Khaled Alwesabi 1 and Hamdi Rajeh 2 1 2

*

School of Information Science and Engineering, Central south University, Changsha 410000, China; [email protected] (A.A.); [email protected] (K.A.) College of Computer Science and Electrical Engineering, Hunan University, Changsha 410000, China; [email protected] Correspondence: [email protected]; Tel.: +86-139-7318-8535

Academic Editor: Tom Burr Received: 21 January 2016; Accepted: 12 April 2016; Published: 18 April 2016

Abstract: Preprocessing is one of the main components in a conventional document categorization (DC) framework. This paper aims to highlight the effect of preprocessing tasks on the efficiency of the Arabic DC system. In this study, three classification techniques are used, namely, naive Bayes (NB), k-nearest neighbor (KNN), and support vector machine (SVM). Experimental analysis on Arabic datasets reveals that preprocessing techniques have a significant impact on the classification accuracy, especially with complicated morphological structure of the Arabic language. Choosing appropriate combinations of preprocessing tasks provides significant improvement on the accuracy of document categorization depending on the feature size and classification techniques. Findings of this study show that the SVM technique has outperformed the KNN and NB techniques. The SVM technique achieved 96.74% micro-F1 value by using the combination of normalization and stemming as preprocessing tasks. Keywords: document categorization; text preprocessing; stemming techniques; classification techniques; Arabic language processing

1. Introduction Given the tremendous growth of available Arabic text documents on the Web and databases, researchers are highly challenged to find better ways to deal with such huge amount of information; these methods should enable search engines and information retrieval systems to provide relevant information accurately, which had become a crucial task to satisfy the needs of different end users [1]. Document categorization (DC) is one of the significant domains of text mining that is responsible for understanding, recognizing, and organizing various types of textual data collections. DC aims to classify natural language documents into a fixed number of predefined categories based on their contents [2]. Each document may belong to more than one category. Currently, DC is widely used in different domains, such as mail spam filtering, article indexing, Web searching, and Web page categorization [3]. These applications are increasingly becoming important in the information-oriented society at present. The document in text categorization system must pass through a set of stages. The first stage is the preprocessing stage which consists of document conversion, tokenization, normalization, stop word removal, and stemming tasks. The next stage is document modeling which usually includes the tasks such as vector space model construction, feature selection, and feature weighting. The final stage is DC wherein documents are divided into training and testing data. In this stage, the training process uses the proposed classification algorithm to obtain a classification model that will be evaluated by means of the testing data. Algorithms 2016, 9, 27; doi:10.3390/a9020027

www.mdpi.com/journal/algorithms

Algorithms 2016, 9, 27

2 of 17

The preprocessing stage is a challenge and affects positively or negatively on the performance of any DC system. Therefore, the improvement of the preprocessing stage for highly inflected language such as the Arabic language will enhance the efficiency and accuracy of the Arabic DC system. This paper investigates the impact of preprocessing tasks including normalization, stop word removal, and stemming in improving the accuracy of Arabic DC system. The rest of this paper is organized as follows. Related works are presented in Sections 2 and 3 provides an overview of the Arabic language structure. An Arabic DC framework is described in Section 4. Experiment and results are presented in Section 5. Finally, conclusion and future works are provided in Section 6. 2. Related Work DC is a high valued research area, wherein several approaches have been proposed to improve the accuracy and performance of document categorization methods. This section summarizes what has been achieved on document categorization from various pieces of the literature. For the English language, the impact of preprocessing on document categorization are investigated by Song et al. [4]. It is concluded that the influence of stop-word removal and stemming are small. However, it is suggested to apply stop-word removal and stemming in order to reduce the dimensionality of feature space and promote the efficiency of the document categorization system. Toman et al. [5] examined the impacts of normalization (stemming or lemmatization) and stop-word removal on English and Czech datasets. It is concluded that stop-word removal improved the classification accuracy in most cases. On the other hand, the impact of word normalization (stemming and lemmatization) was negative. It is suggested that applying stop-word removal and ignoring word normalization can be the best choice for document categorization. Furthermore, the impact of preprocessing tasks including stop-word removal, and stemming on the English language are studied on trimmed versions of Reuters 21,578, Newsgroups and Springer by Pomikálek et al. [6]. It is concluded that using stemming and stop-word removal has very little impact on the overall classification results. Uysal et al. [7] examined the impact of preprocessing tasks such as tokenization, stop-word removal, lowercase conversion, and stemming on the performance of the document categorization system. In their study, all possible combinations of preprocessing tasks are evaluated on two different domains, namely e-mail and news, and in two different languages, namely Turkish and English. Their experiments showed that the impact of stemming, tokenization and lowercase conversion were overall positive on the classification accuracy. On the contrary, the impact of the stop word removal task was negative on the accuracy of classification system. The use of stop-word removal, stemming on spam email filtering, are analyzed by Méndez et al. [8]. It is concluded that performance of SVM is surprisingly better without using stemming and stop-word removal. However, some stop-words are rare in spam messages, and they should not be removed from the feature list in spite of being semantically void. Chirawichitchai et al. [9] introduced the Thai document categorization framework which focused on the comparison of several feature representation schemes, including Boolean, Term Frequency (TF), Term Frequency inverse Document Frequency (TFiDF), Term Frequency Collection (TFC), Length Term Collection (LTC), Entropy, and Term Frequency Relevance Frequency (TF-RF). Tokenization, stop word removal and stemming were used as preprocessing tasks and chi-square as feature selection method. Three classification techniques are used, namely, naive bayes (NB), Decision Tree (DT), and support vector machine (SVM). The authors claimed that using TF-RF weighting with SVM classifier yielded the best performance with the F-measure equaling 95.9%. Mesleh [10,11] studied the impact of six commonly used feature selection techniques based on support vector machines (SVMs) on Arabic DC. Normalization and stop word removal techniques were used as preprocessing tasks in his experiments. He claimed that experiments proved that the stemming technique is not always effective for Arabic document categorization. His experiments

Algorithms 2016, 9, 27

3 of 17

showed that chi-square, the Ng-Goh-Low (NGL) coefficient, and the Galavotti-Sebastiani-Simi (GSS) coefficient significantly outperformed the other techniques with SVMs. Al-Shargabi et al. [12] studied the impact of stop word removal on Arabic document categorization by using different algorithms, and concluded that the SVM with sequential minimal optimization achieved the highest accuracy and lowest error rate. Al-Shammari and Lin [13] introduced a novel stemmer called the Educated Text Stemmer (ETS) for stemming Arabic documentation. The ETS stemmer is evaluated by comparison with output from human generated stemming and the stemming weight technique (Khoja stemmer). The authors concluded that the ETS stemmer is more efficient than the Khoja stemmer. The study also showed that disregarding Arabic stop words can be highly effective and provided a significant improvement in processing Arabic documents. Kanan [14] developed a new Arabic light stemmer called P-stemmer, a modified version of one of Larkey’s light stemmers. He showed that his approach to stemming significantly enhanced the results for Arabic document categorization, when using naive Bayes (NB), SVM, and random forest classifiers. His experiments showed that SVM performed better than the other two classifiers. Duwairi et al. [15] studied the impact of stemming on Arabic document categorization. In their study, two stemming approaches, namely, light stemming and word clusters, were used to investigate the impact of stemming. They reported that the light stemming approach improved the accuracy of the classifier more than the other approach. Khorsheed and Al-Thubaity [16] investigated classification techniques with a large and diverse dataset. These techniques include a wide range of classification algorithms, term selection methods, and representation schemes. For preprocessing tasks, normalization and stop word removal techniques were used. For term selection, their best average result was achieved using the GSS method with term frequency (TF) as the base for calculations. For term weighting functions, their study concluded that length term collection (LTC) was the best performer, followed by Boolean and term frequency collection (TFC). Their experiments also showed that the SVM classifier outperformed the other algorithms. Ababneh et al. [17] discussed different variations of vector space model (VSM) to classify Arabic documents using the k-nearest neighbour (KNN) algorithm; these variations are Cosine coefficient, Dice coefficient, and Jacaard coefficient and use the inverse document frequency (IDF) term weighting method for comparison purposes. Normalization and stop word removal techniques were suggested as preprocessing tasks in their experiment. Their experimental results showed that the Cosine coefficient outperformed Dice and Jaccard coefficients. Zaki et al. [18] developed a hybrid system for Arabic document categorization based on the semantic vicinity of terms and the use of a radial basis modeling. Normalization, stop word removal, and stemming techniques were used as preprocessing tasks. They adopted the hybridization of N-gram + TFiDF statistical measures to calculate similarity between words. By comparing the obtained results, they found that the use of radial basis functions improved the performance of the system. Most of the previous studies on Arabic DC have proposed the use of preprocessing tasks to reduce the dimensionality of feature vectors without comprehensively examining their contribution in promoting the effectiveness of the DC system, which makes this study a unique one that addressed the impact of preprocessing tasks on the performance of Arabic classification systems. To clarify the differences of the present work from the previous studies, the investigated preprocessing tasks and experimental settings are comparatively presented in Table 1. In this table, normalization, stop word removal, and stemming are abbreviated as NR, SR, and ST, respectively. The experimental settings include feature selection (FS in the table), dataset language, and classification algorithms (CA in the table).

Algorithms 2016, 9, 27 Algorithms 2016, 9, 27 Algorithms 2016, 9, 27 Algorithms Algorithms2016, 2016,9, 9,27 27

4 of of 17 16 of 16 4444of of16 16

Table previous studies. Table 1. Comparison of the characteristics of this study with those of the previous studies. Table1. 1.Comparison Comparisonof ofthe thecharacteristics characteristicsof ofthis thisstudy studywith withthose thoseof ofthe theprevious previousstudies. studies. Table Table 1. 1. Comparison Comparison of of the the characteristics characteristics of of this this study study with with those those of of the the previous studies. Study NR SR ST FS Language CA Study NR SR ST FS Language CA Study NR Language CA Study NR SR SR ST ST FS FS Language CA Study NR SR ST FS Language Uysal et al. [7] √ √ √ Turkish and English SVM Uysal et al. [7] √ √ √ Turkish and English SVM Uysal et al. [7] √ √ √ Turkish and English SVM Uysal et al. [7] √ √ √ Turkish and English SVMCA ‘ ‘ √√√ ‘ Pomikálek et al. [6] √ √ English KNN, NB, SVM, NN, etc. Pomikálek et al. [6] √ √ English KNN, NB, SVM, NN, etc. Pomikálek et al. [6] √ √ English KNN, NB, SVM, NN,etc. etc. Pomikálek √ √ √ English NN, Uysal et al. et [7]al. [6] -Turkish and EnglishKNN, NB, SVM, SVM ‘ ‘ √ ‘ Méndez et al. [8] √ √ English NB, SVM, Adaboost, etc. Méndez et al. [8] √ √ √ English NB, SVM, Adaboost, etc. Méndez et al. [8] √ √ √ English NB, SVM, Adaboost, etc. Méndezetetal.al.[6] [8] √ √ √ English NB, KNN, SVM, NB, Adaboost, etc. etc. Pomikálek -English SVM, NN, ‘ ‘ ‘ Song etet al. [4] √√√ √√√ √√√√ English SVM Song al. [4] English SVM Song et al. [4] English SVMAdaboost, etc. Méndez etet al. [8] -√√√√ English NB, SVM, Song al. [4] √ √ English SVM ‘ ‘ ‘ ‘ Toman etet al. [5] English and Czech NB Song et al.et English Toman al. [5] √√ English and Czech NB Toman al. √√√√ √√‘ √√√√‘ ---English Czech NB Toman et[4] al.[5] [5] Englishand and Czech NBSVM ‘ Toman et al. [5] English and Czech NB Chirawichitchai et al. [9] √ √ √ Thai NB, DT, SVM Chirawichitchai et al. [9] √ √ √ Thai NB, DT, SVM Chirawichitchai √‘√ √√‘ √√ ‘ Thai NB, Chirawichitchaietetal. al.[9] [9] - Thai NB,DT, DT,SVM SVM Chirawichitchai et al. [9] Thai NB, DT, Mesleh [10,11] √ √ √ Arabic SVM Mesleh [10,11] √√ Arabic SVM Mesleh √√√ √‘ --√√√ ‘ Arabic SVM Mesleh[10,11] [10,11] Arabic SVM SVM ‘ Mesleh [10,11] Arabic SVM Duwairi etet al. [15] Arabic KNN Duwairi al. [15] √√ Arabic KNN Duwairi ---√√‘ √√√√‘ ---Arabic KNN Duwairiet etal. al.[15] [15] Arabic KNN Duwairi et al. [15] Arabic KNN Kanan [14] Arabic SVM, NB, RF Kanan [14] √√ Arabic SVM, NB, RF Kanan ---√√‘ √√√√‘ ---Arabic SVM, Kanan[14] [14] Arabic SVM,NB, NB,RF RF Kanan [14] Arabic SVM, NB, RF ‘ ‘ ‘ Zaki et al. [18] √ √ √ Arabic KNN Zaki et al. [18] √√ √√ √√ -Arabic KNN Zaki et al. [18] √ √ √ Arabic KNN Zaki et al. [18] Arabic KNN Zaki et al. [18] Arabic KNN Al-Shargabi etet al. [12] Arabic NB, SVM, J48 Al-Shargabi al. [12] √√ Arabic NB, SVM, J48 Al-Shargabi et √√‘ ----- ---- Arabic NB, SVM, J48 Al-Shargabi etal. al.[12] [12] Arabic NB,NB, SVM, J48 J48 Al-Shargabi et al. [12] ----Arabic SVM, ‘ ‘ Khorsheed et al. [16] √ √ Arabic KNN, NB, SVM, etc. Khorsheed al. [16] Arabic KNN, NB, SVM, etc. Khorsheed etet √√√ ---- √√√ Arabic KNN, NB, SVM, etc. Khorsheed etal. al.[16] [16] Arabic KNN, NB,NB, SVM, etc.etc. Khorsheed et al. [16] ---Arabic KNN, SVM, ‘ ‘ Ababneh etet al. [17] Arabic KNN Ababneh al. [17] Arabic KNN Ababneh et √√√√ √√√√ ----- ---- Arabic KNN Ababneh etal. al.[17] [17] Arabic KNN Ababneh et al. [17] Arabic KNN ‘ Arabic NB, KNN, SVM The proposed study Arabic NB, KNN, SVM The proposed study √√ Arabic NB, KNN, SVM The proposed study Arabic NB,NB, KNN, SVM The proposed study √√√√ √√‘ √√√√‘ √√√√ ‘ The proposed study Arabic KNN, SVM

3. Overview of Arabic Language Structure 3. Overview of Arabic Language Structure 3.Overview Overviewof ofArabic ArabicLanguage LanguageStructure Structure 3. 3. Overview of Arabic Language Structure The Arabic language isis semantic language with complicated morphology, which isis The Arabic language semantic language with complicated morphology, which The Arabic Arabic language language is is aaaa semantic semantic language language with with aaaa complicated complicated morphology, morphology, which which is is The The Arabic language is a semantic language with a complicated morphology, which is significantly significantly different from the most popular languages, such as English, Spanish, French, and significantly different from the most popular languages, such as English, Spanish, French, and significantly different different from from the the most most popular popular languages, languages, such such as as English, English, Spanish, Spanish, French, French, and and significantly different from the most popular languages, such as English, Spanish, French, and Chinese. The Arabic Chinese. The Arabic language native language of the Arab states and the secondary language in Chinese. Chinese.The TheArabic Arabiclanguage languageisis isaaanative nativelanguage languageof ofthe theArab Arabstates statesand andthe thesecondary secondarylanguage languagein in language is a native language of the Arab states and the secondary language in a number of other a number of other countries [19]. More than 422 million people are able to speak Arabic, which makes a number of other countries [19]. More than 422 million people are able to speak Arabic, which makes a number of other countries [19]. More than 422 million people are able to speak Arabic, which makes a number of other countries [19]. More than 422 million people are able to speak Arabic, which makes countries [19]. More than 422 million people are able to speak Arabic, which makes this language this language the fifth most spoken language in the world, according to [14]. The alphabet of the this thislanguage languagethe thefifth fifthmost mostspoken spokenlanguage languagein inthe theworld, world,according accordingto to[14]. [14].The Thealphabet alphabetof ofthe the the fifth most spoken language in the world, according to [14]. The alphabet of the Arabic language Arabic language consists of 28 letters: Arabic Arabiclanguage languageconsists consistsof of28 28letters: letters: consists of 28 letters:

‫–––ممم–––ننن‬--‫–––šللل‬--‫–––—ككك‬--‫––”ققق‬-–––‫ص‬ ‫ش‬ ‫ززز‬C–––--‫ررر‬Ж–-‫ف‬ ––‫غ‬-‫‰ع–––غغ‬-‫_ظ–––عع‬ ‫– ظظ‬-––‫ط‬ª ‫– طط‬-––‫ض‬ ‫ض‬ ‫ص‬ ‫ش‬ --‫س‬ ‫س‬ ––‫دد‬-‫ث‬ ‫ب‬ ‘– -| Q - E-----‫–––ح‬ --‫––ججج‬-–‫ث‬ 

‫ت‬ –‫ف‬ ‫ف‬ ‫ض‬-‫ص‬--–––M ‫ش‬- -x ‫س‬ –‫– ذذذ‬-‫خخ–––د‬ ‫حح–––خ‬ ‫ث‬-----‫ت‬ ‫ت‬-–––‫ب‬ ‫ ب‬--‫أ‬-‫أأ‬ © -¤ - þ¡‫يي‬ -‫Ÿي‬---‫و‬-‫ وو‬---‫ھـھـ‬ ‫ ھـ‬--The Arabic language isis classified into three forms: Classical Arabic (CA), Colloquial Arabic The Arabic language classified into three forms: Classical Arabic (CA), Colloquial Arabic TheArabic Arabiclanguage languageis isclassified classifiedinto intothree threeforms: forms:Classical ClassicalArabic Arabic(CA), (CA),Colloquial ColloquialArabic Arabic The forms: Dialects (CAD), and Modern Standard Arabic (MSA). CA fully vowelized and includes classical Dialects (CAD), and Modern Standard Arabic (MSA). CA Dialects Dialects(CAD), (CAD),and andModern ModernStandard StandardArabic Arabic(MSA). (MSA).CA CAisis isfully fullyvowelized vowelizedand andincludes includesclassical classical historical liturgical text and old literature texts. CAD includes predominantly spoken vernaculars, historical liturgical text and old literature texts. CAD includes predominantly spoken vernaculars, historicalliturgical liturgicaltext text and old literature texts. CAD includes predominantly spoken vernaculars, and old literature texts. CAD includes predominantly spoken vernaculars, and historical liturgical text and old literature texts. CAD includes predominantly spoken vernaculars, and each Arab country has its dialect. MSA is the official language and includes news, media, and and each Arab country has its dialect. MSA is the official language and includes news, media, and each Arab country has its dialect. MSA MSA is the is official language and includes news, media, and official andeach eachArab Arab country has itsdialect. dialect. MSA isthe theofficial official language andincludes includes news,media, media, and and country has its language and news, and official documents [3]. The direction of writing in the Arabic language is from right to left. documents [3]. The[3]. direction of writing in the Arabic is from is right to right left. official The of in Arabic officialdocuments documents [3]. Thedirection direction ofwriting writing inthe thelanguage Arabiclanguage language isfrom from rightto toleft. left. The Arabic language has two genders, feminine and masculine );););three three numbers, The Arabic language has two genders, feminine and masculine three numbers, TheArabic Arabiclanguage languagehas hastwo two genders, feminine (‫مؤنث‬ ‫مؤنث‬ )and and masculine (‫مذكر‬ ‫;)مذكر‬ threenumbers, numbers, genders, feminine (ž¥› ))))and masculine (r•@› The language has two genders, feminine (((‫مؤنث‬ masculine (((‫مذكر‬ singular dual ‫مثنى‬ and plural ‫جمع‬ and grammatical cases, nominative (Yn› singular rf› dual (((‫مثنى‬ and plural three singular cases, nominative ((‫الرفع‬ singular (((‫مفرد‬ (‫مفرد‬ ‫)مفرد‬,),),),dual ‫)مثنى‬,),),),and and plural ((‫جمع‬ (‰m ‫;);););)جمع‬and and three three grammatical grammatical cases, cases,nominative nominative(‰r˜ (‫الرفع‬ ‫)الرفع‬,),), accusative and ‫الجر‬ In general, Arabic words are as particles genitive genitive(((‫الجر‬ categorized categorizedas genitive Ingeneral, general,Arabic categorized as ¤‫ادوا‬ accusative ‫النصب‬ ‫ت‬ accusative(( On˜ (‫النصب‬ ‫)النصب‬,),),and andgenitive (r˜ ‫)الجر‬.).).).In In general, Arabicwords wordsare arecategorized asparticles particles(( (‫ت‬ ‫ادوات‬ ‫)ادوا‬,),), nouns ( ‫اسماء‬ ), or verbs ( ‫افعال‬ ). Nouns in Arabic including adjectives ( ‫صفات‬ ) and adverbs ( ‫ظروف‬ ) and nouns ‫اسماء‬ or verbs in Arabic including adjectives and adverbs and nouns orverbs verbs((((‫افعال‬ šA` ).Nouns Nouns Arabic including adjectives (Af} ) and adverbs (‘¤rZ ) nouns(((ºAmF ‫)اسماء‬,),),),or or verbs ‫افعال‬ Nouns inin Arabic including adjectives ‫صفات‬ and adverbs ‫ظروف‬ and nouns ).).).Nouns in Arabic including adjectives (((‫صفات‬ )))and adverbs (((‫ظروف‬ )))and can be derived from other nouns, verbs, or particles. Nouns in the Arabic language cover proper and can be derived from other nouns, verbs, particles. Nouns inthe theArabic Arabiclanguage proper can from other nouns, verbs, or particles. Nouns in canbe bederived derived from other nouns, verbs, oror particles. Nouns in the Arabic languagecover coverproper proper nouns (such as people, places, things, ideas, day and month names, etc.). A noun has the nominative nouns (such as people, places, things, ideas, day and month names, etc.). A noun has the nominative nouns(such (suchas aspeople, people,places, places,things, things,ideas, ideas,day dayand andmonth monthnames, names,etc.). etc.).A Anoun nounhas hasthe thenominative nominative nouns case when it is the subject ( ‫فاعل‬ ); accusative when it is the object of a verb ( ‫مفعول‬ ) and the genitive case when the subject ((‫فاعل‬ accusative when the object of verb the genitive and andthe (šw`f› and the casewhen whenitititis isthe thesubject subject((™ˆA ‫;););)فاعل‬accusative accusativewhen whenitititis isthe theobject objectof ofaaaverb averb verb ‫)))مفعول‬and thegenitive genitive case isis the subject isis the object of (((‫مفعول‬ when the object of preposition ‫بحرف‬ ‫مجرور‬ [20]. Verbs in Arabic are divided into perfect itititis aaapreposition (r C¤r› )[20]. [20]. Verbsin inArabic Arabicare divided perfect when isis the object ((‫جر‬ ‫بحرف‬ Verbs whenit isthe theobject objectofof of apreposition preposition (‫جر‬ ‫‘جر‬r ‫مجروربحرف‬ ‫[)))مجرور‬20]. Verbs in Arabic aredivided dividedinto intoperfect perfect (((‫التام‬ ‫ة‬ ‫صيغ‬ ), imperfect ( ‫الناقص‬ ‫الف‬ ‫صيغة‬ ) and imperative ( ‫االمر‬ ‫صيغة‬ ). Arabic particle category ‫التام‬ ‫الفعل‬ ‫ة‬ ‫صيغ‬ ), imperfect ( ‫الناقص‬ ‫الف‬ ‫صيغة‬ ) and imperative ( ‫االمر‬ ‫صيغة‬ ). Arabic particle category (At˜ ‫الفعل التام‬ ‫الفعل‬ ‫ة‬ ‫صيغ‬ ), imperfect ( ‫الناقص‬ ‫الف‬ ‫صيغة‬ ) and imperative ( ‫االمر‬ ‫صيغة‬ ). Arabic particle category ™`f˜ TŒy} ), imperfect( P’An˜ ˜ TŒy} ) and imperative( r›¯ TŒy} Arabicparticle particlecategory category ), imperfect ) and imperative ). ). Arabic includes pronouns( ‫الضمائر‬ adjectives( ‫الصفات‬ adverbs( ‫االحوال‬ conjunctions( ‫العطف‬ prepositions adverbs( r¶AmS˜ AfO˜ w¯),),),), conjunctions( conjunctions( W`˜ prepositions includes pronouns( ‫الضمائر‬ adjectives( ‫الصفات‬ ‫االحوال‬ includespronouns( pronouns( ‫)الضمائر‬,),), adjectives( adjectives( ‫)الصفات‬,),),), adverbs( adverbs(š ‫االحوال‬ conjunctions(‫العطف‬ ‫)العطف‬,),),), prepositions prepositions ‫حروف‬ interjections ‫صي‬ and interrogatives ‫عالمات‬ ))[21]. [21]. r˜ ‘¤r ), interjections ( `t˜ ) interrogatives and interrogatives (AhftF¯ ((‫الجر‬ ),),),interjections ((‫التعجب‬ ‫صي‬ ((‫االستفھام‬ ‫عالمات‬ (‫الجر‬ ‫حروفالجر‬ ‫حروف‬ interjections (‫التعجب‬ ‫التعجب‬ ‫}¨))صي‬ )and and interrogatives (‫االستفھام‬ ‫االستفھام‬ ‫)عالمات‬A›®ˆ [21]. ) [21]. Moreover, diacritics are used in the Arabic language, which are symbols placed above or below Moreover, diacritics are used in the Arabic language, which are symbols placed above or below or below Moreover,diacritics diacriticsare areused usedin inthe theArabic Arabiclanguage, language,which whichare aresymbols symbolsplaced placedabove aboveor orbelow below Moreover, the letters to add distinct pronunciation, grammatical formulation, and sometimes another meaning the theletters lettersto toadd adddistinct distinctpronunciation, pronunciation,grammatical grammaticalformulation, formulation,and andsometimes sometimesanother anothermeaning meaning to the whole word. Arabic diacritics include hamza ),),),shada shada ),),),dama dama ),),),fathah fathah ),),),kasra kasra ),),),), to the whole word. Arabic diacritics include hamza shada dama fathah kasra tothe thewhole wholeword. word.Arabic Arabicdiacritics diacriticsinclude dama(((◌ (ُ ◌ُ◌ُ ), kasra(((◌ (ِ◌◌ ِ ِ ), word. diacritics includehamza hamza(((‫ء(ء‬º‫)ء‬, shada(((◌ (ّ ◌ّ◌ّ ), fathah((((َ◌َ◌َ◌), to ْ ٌ ً◌ sukon ( ◌ ), double dama ( ◌ ), double fathah ( ), double kasra ( ) [22]. For instance, Table 2 presents ْ ٌ ٍ◌ sukon double dama double fathah double kasra [22]. For instance, Table presents sukon instance, Table sukon((((◌◌ْ ),),),double doubledama dama(((◌◌ٌ ),),),double doublefathah fathah(((ً◌ً◌),),),double doublekasra kasra((( ٍ◌ٍ◌)))[22]. [22].For Forinstance, instance,Table Table222presents presents sukon of the letter (Sad) different pronunciations of the letter (Sad) different ‫ص‬ differentpronunciations pronunciationsof ofthe theletter letter(Sad) (Sad)(((Q (‫ص‬ ‫))ص‬:):):: Table 2. Presents different pronunciations of the letter (Sad) ‫ص‬ Table 2. Presents different pronunciations of the letter (Sad) Table2. 2.Presents Presentsdifferent differentpronunciations pronunciationsof ofthe theletter letter(Sad) (Sad)(((..(.‫ص‬ .‫ص‬ ‫))))ص‬ Table

‫ص‬ ْ‫ص‬ ‫ص‬ ْْ

‫ص‬ ٌ‫ص‬ ٌ‫ص‬ ٌ

‫ص‬ ٍ‫ص‬ ٍ‫ص‬ ٍ

‫ص‬ ً‫ص‬ ً‫ص‬ ً

‫ص‬ ‫صﱠ‬ ‫صﱠ‬ ‫ﱠ‬

‫ص‬ ‫صﱡ‬ ‫صﱡ‬ ‫ﱡ‬

‫ص‬ ‫ص‬ ‫ص‬ ‫ص‬ ّ‫ص‬ ُ‫ص‬ َ‫ص‬ ّ‫ص‬ ُ‫ص‬ َ‫ص‬ ِ‫ص‬ ّ ُ َ ِ‫ص‬ ِ

/s/ /sun/ /sin/ /san/ /ssa/ /ssu/ /ss/ /si/ /sa/ /su/ /s/ /s/ /sun/ /sun/ /sin/ /sin/ /san/ /san/ /ssa/ /ssa/ /ssu/ /ssu/ /ss/ /ss/ /si/ /si/ /sa/ /sa/ /su/ /su/

(‫)حروف الجر‬, interjections (‫ )صي التعجب‬and interrogatives (‫[ )عالمات االستفھام‬21]. Moreover, diacritics are used in the Arabic language, which are symbols placed above or below the letters to add distinct pronunciation, grammatical formulation, and sometimes another meaning to the whole word. Arabic diacritics include hamza (‫)ء‬, shada (◌ّ ), dama (◌ُ ), fathah ( َ◌), kasra (◌ِ ), Algorithms 9, 27 dama (◌ 5 of 17 ٌ ), double fathah ( ً◌), double kasra ( ٍ◌) [22]. For instance, Table 2 presents sukon (◌ْ 2016, ), double different pronunciations of the letter (Sad) (‫)ص‬: Table 2. Presents different pronunciations of the letter (Sad) (Q).

Table 2. Presents different pronunciations of the letter (Sad) (.‫)ص‬ Algorithms 2016, Algorithms 2016, 9, 9, 2727

‫ص‬ ْ

‫ص‬ ٌ

‫ص‬ ٍ

‫ص‬ ً

‫ص‬ ‫ﱠ‬

‫ص‬ ‫ﱡ‬

‫ص‬ ّ

‫ص‬ ِ

‫ص‬ َ

‫ص‬ ُ

5 of 5 of 1616

/s/ /sun/ /sun/ /sin/ /san/ /san/ /ssa/ /ssa/ /ssu/ /ssu/ /ss/ /ss/ /si/ /sa/ /su/ /su/ /si/ /sa/ Arabicis isa achallenging challenging/s/ languagein/sin/ incomparison comparisonwith withother otherlanguages languages such Englishforfora a Arabic language such asasEnglish number reasons: number ofof reasons: Arabic is a challenging language in comparison with other languages such as English for a number TheArabic Arabiclanguage languagehas hasa arich richand andcomplex complexmorphology morphologyinincomparison comparisonwith withEnglish. English.ItsIts • • The of reasons: richnessis isattributed attributedtotothe thefact factthat thatone oneroot rootcan cangenerate generateseveral severalhundreds hundredsofofwords wordshaving having richness different meanings. Table 3 presents different morphological forms root study (‫درس‬ different meanings. Table 3 presents different morphological forms ofof root study (‫درس‬ ). ). The Arabic language has a rich and complex morphology in comparison with English. Its richness ‚ is attributed toDifferent the fact morphological that one root forms can generate several hundreds of words having different Table Different morphological forms word study (‫درس‬ Table 3. 3. ofof word study (‫درس‬ ). ). meanings. Table 3 presents different morphological forms of root study (xC ). Word Tense Tense Pluralities Pluralities Meaning Gender Word Meaning Gender Table 3. Different forms ofMasculine word study (xC ). Past Single morphological He studied Masculine ‫درس‬ Past Single He studied ‫درس‬ Past Single She studied Feminine Feminine ‫ درست‬Past Single She studied ‫درست‬ Word Present Gender PresentTense Single Pluralities He studies Meaning Masculine ‫يدرس‬ Single He studies Masculine ‫يدرس‬ xC Present Single studied Masculine Present Past Single She studied HeFeminine Feminine ‫تدرس‬ Single She studied ‫تدرس‬

FC Past Single studied Feminine Past PastDual Dual They studiedSheMasculine Masculine ‫درسا‬ They studied ‫درسا‬ xCd§ Present Single He studies Masculine Past Dual They studied Feminine Feminine ‫درستا‬ Dual They studied ‫درستا‬ xCd Past Present Single She studied Feminine Present Dual They study Masculine ‫يدرسان‬ Present Dual They study Masculine ‫يدرسان‬ AFC Past Dual They studied Masculine Present PastDual Dual They studyThey Feminine ‫تدرسان‬ They study Feminine ‫تدرسان‬ AtFC Present Dual studied Feminine  AFCd§ Present Dual They study Masculine Present Dual They study Masculine Masculine ‫ يدرسا‬Present Dual They study ‫يدرسا‬  AFCd Dual study Feminine PresentPresent Dual They study They Feminine ‫ تدرسا‬Present Dual They study Feminine ‫تدرسا‬ AFCd§ Present Dual They study Masculine Past Present Plural They studiedThey Masculine ‫درسوا‬ Plural They studied Masculine ‫درسوا‬ AFCd Past Dual study Feminine Past Past Plural They studiedThey Feminine ‫درسن‬ Plural They studied Feminine ‫درسن‬ wFC Past Plural studied Masculine ŸFC Present Plural studied Feminine Present Past Plural They studyThey Masculine ‫يدرسوا‬ Plural They study Masculine ‫يدرسوا‬ wFCd§ Present Plural They study Masculine Present Plural Plural They study Feminine Feminine ‫ تدرسن‬Present They study ‫تدرسن‬ ŸFCd Present Plural They study Feminine Future Single They will study Masculine ‫سيدرس‬ Future Single They will study Masculine ‫سيدرس‬ xCdyF Future Single They will study Masculine FutureFuture Single They They will study Feminine ‫ستدرس‬ Single will study ‫ستدرس‬ xCdtF Future Single TheyFeminine will study Feminine FutureFuture Dual They will study Masculine AFCdyF Future Dual TheyMasculine will study Masculine ‫سيدرسا‬ Dual They will study ‫سيدرسا‬ AFCdtF Future Dual TheyFeminine will study Feminine FutureFuture Dual They will study Feminine ‫ستدرسا‬ Dual They will study ‫ستدرسا‬  wFCdyF Future Plural They will study Masculine Future Plural Plural They They will study Masculine Masculine ‫ سيدرسون‬Future will study ‫سيدرسون‬  wFCdtF Future Plural They will study Feminine Future Plural Plural They They will study Feminine Feminine ‫ ستدرسون‬Future will study ‫ستدرسون‬ ‚ Inprefixes English, prefixes andare suffixes aretoadded to the beginning or of the root to create new words. English, prefixesand andsuffixes suffixes areadded added tothe thebeginning beginning end ofthe the rootto tocreate create new • • InInEnglish, ororend ofend root new In Arabic, in addition to the prefixes and suffixes there are infixes that can be added words.InInArabic, Arabic,ininaddition additiontotothe theprefixes prefixesand andsuffixes suffixesthere thereare areinfixes infixesthat thatcan canbebeadded addedinside the words. word to create new words that have the same meaning. For example, in in English, thethe word inside the word create new words that have the same meaning. For example, in English, the write is inside the word toto create new words that have the same meaning. For example, English, the root of word writer. In Arabic, the word writer ( A• ) is derived from the root write word write the root word writer. Arabic, the word writer (‫كاتب‬ ) is derived from the root ( t•) word write is is the root ofof word writer. InIn Arabic, the word writer (‫كاتب‬ ) is derived from the root by adding the letter Alef ( ) inside the root. In these cases, it is difficult to distinguish write (‫كتب‬ ) by adding the letter Alef ) inside the root. these cases, difficult distinguish between write (‫كتب‬ ) by adding the letter Alef (‫ا()ا‬inside the root. InIn these cases, it it is is difficult toto distinguish infix letters and the root letters. between infix letters and the root letters . . between infix letters and the root letters Some Arabic words have different meanings based on their appearance in the context. Especially ‚ Some Arabic words have different meanings based their appearance the context. Especially • • Some Arabic words have different meanings based onon their appearance inin the context. Especially whenare diacritics are the not used, the proper of meaning of theword Arabic word can be determined when diacritics are not used, the proper meaning ofthe the Arabic word can determined based based on when diacritics not used, proper meaning Arabic can bebe determined based

ْ lˆ ْ ((‫عل"œم‬ thecontext. context. Forinstance, instance, theword word (‫علم‬ ) could ),Teach ) or orFlag Flag((‫ع(لَ ْم‬ ) ononthe For the (‫علم‬ )(œlˆ could bebeScience (‫علم‬ ), (‫)ع)َلﱠ ْمع((َلﱠ ْم‬or َ ‫ َع)لَ) ْم‬depending the context. For instance, the word ) could beScience Science ),Teach Teach Flag depending on the diacritics [23]. depending on the diacritics [23]. on the diacritics [23]. Another challengeof ofautomatic automatic Arabictext text processing thatproper nounsin inArabic Arabic not • • Another Arabic processing is isthat dodonot ‚ challenge Another challenge of automatic Arabic text processing isproper that nouns proper nouns in Arabic do not start startwith withawith acapital capital letter as in English, and Arabic letters do not have lower and upper case, start letter as in English, and Arabic letters do not have lower and upper case, a capital letter as in English, and Arabic letters do not have lower and upper case, which which makes identifying proper names, acronyms, and abbreviations difficult. which makes identifying proper names, acronyms, and abbreviations difficult. makes identifying proper names, acronyms, and abbreviations difficult. There are several free benchmarking English datasets used for document categorization, such • • There are several free benchmarking English datasets used for document categorization, asas 2020 ‚ There are several free benchmarking English datasets used for document such categorization, such Newsgroup, which contains around 20,000 documents distributed almost evenly into 20 classes; Newsgroup, which contains around 20,000 documents distributed almost evenly into 20 classes; as 20 Newsgroup, which contains around 20,000 documents distributed almost evenly into Reuters 21,578, which contains 21,578 documents belonging 17 classes; and RCV1 (Reuters Corpus Reuters 21,578, which contains 21,578 documents belonging toto 17 classes; and RCV1 (Reuters Corpus 20 classes; Reuters 21,578, which contains 21,578 documents belonging to 17 classes; and RCV1 Volume which contains 806,791 documents classified into four main classes. Unfortunately, there Volume 1),1), which contains 806,791 documents classified into four main classes. Unfortunately, there (Reuters Corpus Volume 1), which contains 806,791 documents classified into four main classes. free benchmarking dataset Arabic document classification. this work, overcome this is is nono free benchmarking dataset forfor Arabic document classification. InIn this work, toto overcome this issue,we wehave haveused usedananin-house in-housedataset datasetcollected collectedfrom fromseveral severalpublished publishedpapers papersforforArabic Arabic issue, document classification and from scanning the well-known and reputable Arabic websites. document classification and from scanning the well-known and reputable Arabic websites. theArabic Arabiclanguage, language,the theproblem problemofofsynonyms synonymsand andbroken brokenplural pluralforms formsare arewidespread. widespread. • • InInthe Examples synonyms Arabic are (‫تقدم‬ ,‫تعال‬ ,‫أقبل‬ ,‫ھلم‬ ) which means (Come), and (‫منزل‬ ,‫دار‬ ,‫بيت‬ Examples ofof synonyms inin Arabic are (‫تقدم‬ ,‫تعال‬ ,‫أقبل‬ ,‫ھلم‬ ) which means (Come), and (‫منزل‬ ,‫دار‬ ,‫بيت‬ , ,

Algorithms 2016, 9, 27







6 of 17

Unfortunately, there is no free benchmarking dataset for Arabic document classification. In this work, to overcome this issue, we have used an in-house dataset collected from several published papers for Arabic document classification and from scanning the well-known and reputable Arabic websites. In the Arabic language, the problem of synonyms and broken plural forms are widespread. Examples of synonyms in Arabic are (dq , šA` , ™b’, œl¡) which means (Come), and (šzn›, C , y, ŸkF) which means (house). The problem of broken plural forms occurs when some irregular nouns in the Arabic language in plural takes another morphological form different from its initial form in singular. In the Arabic language, one word may have more than lexical category (noun, verb, adjective, etc.) in different contexts such as (wellspring, “ºAm˜ Ÿyˆ”), (Eye, “ Asž¯ Ÿyˆ”), (was appointed, “­CAtl˜ r§E¤ Ÿyˆ”). In addition to the different forms of the Arabic word that result from the derivational process, there are some words lack authentic Arabic roots like Arabized words which are translated from other languages, such as (programs, “› r”), (geography, “Ty rŒ”), (internet, “ žrtž³ ”), etc. or names, places such as (countries, “  dlb˜ ”), (cities, “ dm˜ ”), (rivers, “CAhž¯ ”), (mountains, “šAb˜ ”), (deserts,” «CAO˜ ”), etc.

As a result, the difficulty of the Arabic language processing in Arabic document categorization is associated with the complex nature of the Arabic language, which has a very rich and complicated morphology. Therefore, the Arabic language needs a set of preprocessing routines to be suitable for classification. 4. Arabic Document Categorization Framework An Arabic Document Categorization Framework usually consists of three main stages: the preprocessing stage, the document modeling stage, and document classification. The preprocessing stage involves document conversion, tokenization, stop word removal, normalization, and stemming. The document modeling stage includes vector space model construction, term selection, and term weighting. The document classification stage covers classification model construction and classification model evaluation. These phases will be described in details in the following subsections. 4.1. Data Preprocessing Document preprocessing, which is the first step in DC, converts the Arabic documents to a form that is suitable for classification tasks. These preprocessing tasks include a few linguistic tools such as tokenization, normalization, stop word removal, and stemming. These linguistic tools are used to reduce the ambiguity of words to increase the accuracy and effectiveness of the classification system. 4.1.1. Text Tokenization Tokenization is a method for dividing texts into tokens. Words are often separated from each other by blanks (white space, semicolons, commas, quotes, and periods). These tokens could be individual words (noun, verb, pronoun, article, conjunction, preposition, punctuation, numbers, and alphanumeric) that are converted without understanding their meanings or relationships. The list of tokens becomes input for further processing. 4.1.2. Stop Word Removal Stop word removal involves elimination of insignificant words, such as so @˜, for ™¯, and with ‰›, which appear in the sentences and do not have any meaning or indications about the content. Other examples of these insignificant words are articles, conjunctions, pronouns (such as he w¡, she ¨¡, and they œ¡), prepositions (such as from Ÿ›, to Y˜ , in ¨, and about šw), demonstratives, (such as this @¡, these º¯w¡, and there –·˜¤ ), and interrogatives (such as where Ÿ§ , when Yt›, and whom

Algorithms 2016, 9, 27

7 of 17

Ÿm˜). Moreover, Arabic circumstantial nouns indicating time and place (such as after d`, above ”w, and beside žA), signal words (such as first ¯¤ , second AyžA, and third A˜A) as well as numbers and symbols (such as @, #, &, %, and *) are considered insignificant and can be eliminated. A list of 896 words was prepared to be eliminated from all the documents. 4.1.3. Word Normalization Normalization Algorithms Algorithms2016, 2016,9,9,2727

aims to normalize certain letters that have different forms in the same word to16 7 7ofof16 one form. For example, the normalization of “º” (hamza), “” (aleph mad), “” (aleph with hamza on top), top), (hamza on waw), (alef with hamza at the bottom), and (hamza on ya) to (alef). top),““‫(””ؤ”¦“ؤ‬hamza (hamzaon onwaw), waw),““‫(””إ”“إ‬alef (alefwith withhamza hamzaat atthe thebottom), bottom),and and““‫ئ‬ “¹ ‫(”””ئ‬hamza (hamzaon onya) ya)to to““‫(””ا”“ا‬alef). (alef). Another example normalization of the letter “ ‫ى‬ ” to ‫ي‬ ” and the ‫ة‬ ” to “ ‫ه‬ ”. We also remove is the normalization of the letter “ « ” to “ © letter “ ­ ” to “ £ ”. We also remove Another example is the normalization of the letter “‫ ”ى‬to “‫ ”ي‬and the letter “‫ ”ة‬to “‫”ه‬. We also remove the because these the diacritics such as diacritics are not used ininextracting extracting the Arabic roots thediacritics diacriticssuch suchas as{{{ْ ,ْ ,ٍ ,ٍ ,ِ ,ِ ,ٌ ,ٌ ,ُ ,ُ ,َ, ً◌َ, }}ً◌}because becausethese thesediacritics diacriticsare arenot notused usedin extractingthe theArabic Arabicroots roots and and not useful in the classification. Finally,we weduplicate duplicatethe thatinclude includethe thesymbols symbols“““““◌ّ ◌ّ andnot notuseful usefulin inthe theclassification. classification.Finally, theletters lettersthat that include the symbols ‫الشدة‬ ­dK˜ areused usedtotoextract extract the Arabic roots and removing them affects the meaning ‫الشدة‬, because , becausethese theseletters lettersare extractthe theArabic Arabicroots rootsand andremoving removingthem themaffects affectsthe themeaning meaning of the words. ofofthe thewords. words. 4.1.4. Stemming 4.1.4. 4.1.4.Stemming Stemming Stemming can be defined as the process of removing all affixes (such as prefixes, infixes, and Stemming Stemmingcan canbe bedefined definedas asthe theprocess processof ofremoving removingall allaffixes affixes(such (suchas asprefixes, prefixes,infixes, infixes,and and suffixes) from words. Stemming reduces different forms of the word that reflect the same meaning suffixes) from words. Stemming reduces different forms of the word that reflect the same meaning suffixes) from words. Stemming reduces different forms of the word that reflect the same meaning in root or stem). For example, the words (teachers, “ wml`m˜ ”), in totosingle single form (its root oror stem). For example, the words (teachers, “‫المعلمون‬ ”), inthe thefeature featurespace spaceto singleform form(its (its root stem). For example, the words (teachers, “‫المعلمون‬ ”), (teacher, ”), (teacher (Feminine), “¢ml`› ”), (learner, “”), œl`t› ”), (scientist, “are œ˜Aˆ ”) arefrom derived (teacher, (teacher (Feminine), “‫معلمه‬ ”), “‫متعلم‬ ”), (scientist, “‫عالم‬ ”)”)are derived (teacher,““‫المعلم‬ “œl`m˜ ‫)”المعلم‬, ”), (teacher (Feminine), “‫معلمه‬ ”),(learner, (learner, “‫متعلم‬ (scientist, “‫عالم‬ derived fromthe the from the same root“(science, “œlˆ ”) or the (teacher, same stem (teacher, “these œl`›words ”). Allshare thesethe words share the same (science, ‫علم‬ ”)”)ororthe same stem “‫معلم‬ ”). same abstract sameroot root (science, “‫علم‬ the same stem (teacher, “‫معلم‬ ”).All Allthese words share the same abstract same abstract meaning of action or movement. Using the stemming techniques in the DC makes meaning of action or movement. Using the stemming techniques in the DC makes the processes less meaning of action or movement. Using the stemming techniques in the DC makes the processes less the processes dependent onofparticular forms of words and reduces the size of features, dependent on particular forms reduces the size ofofpotential features, ininturn, dependent onless particular forms ofwords wordsand and reduces thepotential potential size features,which, which, turn, improve the performance of the classifier. For the Arabic language, the most common stemming which, in turn, improve the performance of the classifier. For the Arabic language, the most common improve the performance of the classifier. For the Arabic language, the most common stemming approaches are stemming approach. stemming approaches are the root-based stemming approach andstemming the light stemming approaches arethe theroot-based root-based stemmingapproach approachand andthe thelight light stemming approach.approach. The root-based stemmer uses morphological patterns to The root-based stemmer morphological patterns extract the root. Several root-based The root-based stemmer uses morphological patterns to extract the root.Several Severalroot-based root-based stemmer algorithms have been developed for the Arabic language. For example, the author inin[24] stemmer algorithms have been developed for the Arabic language. For example, the author in [24] has stemmer algorithms have been developed for the Arabic language. For example, the author [24] has an that starts by and infixes. developed an algorithm that starts removing suffixes,suffixes, prefixes, and infixes. Next, thisNext, algorithm hasdeveloped developed analgorithm algorithm thatby starts byremoving removing suffixes,prefixes, prefixes, and infixes. Next,this this algorithm matches the remaining word against a list of patterns of the same length to extract the root. matches the remaining word against a list of patterns of the same length to extract the root. The algorithm matches the remaining word against a list of patterns of the same length to extract the root. The root isismatched then known “valid” roots. the extracted root is then againstagainst a list ofaknown However, the root-based stemming Theextracted extracted root thenmatched matched against alist listofof“valid” knownroots. “valid” roots.However, However, theroot-based root-based stemming technique increases the ambiguity because words meanings technique the word ambiguity because several wordsseveral have different meanings but have stems stemmingincreases technique increases theword word ambiguity because several wordshave havedifferent different meanings but have stems from the same root. Hence, these words will always be stemmed to this root, which, from the same root. Hence, these words will always be stemmed to this root, which, in turn, leads to but have stems from the same root. Hence, these words will always be stemmed to this root, which, in turn, leads words ‫مقصود‬ ain poor performance. Forperformance. example, the For words wOq›the , the d}A’ , T§ AOt’¯ have different meanings, but turn, leadstotoaapoor poor performance. Forexample, example, words ‫مقصود‬, ,‫قاصد‬ ‫قاصد‬, ,‫االقتصادية‬ ‫االقتصادية‬have havedifferent different meanings, but they stemmed to one root, that is, “ ‫قصد‬ ”, this root is far abstract from the stem and will they stemmed to one root, that is, “ dO’ ”, this root is far abstract from the stem and will lead to a very meanings, but they stemmed to one root, that is, “‫”قصد‬, this root is far abstract from the stem and will lead performance ofofthe poor performance of the system [25]. leadto toaavery verypoor poor performance thesystem system[25]. [25]. The The light stemmer approach isisthe the process of stripping off the most frequent suffixes and prefixes Thelight lightstemmer stemmerapproach approachis theprocess processof ofstripping strippingoff offthe themost mostfrequent frequentsuffixes suffixesand andprefixes prefixes depending on a predefined list of prefixes and suffixes. The light stemmer approach aims not depending on a predefined list of prefixes and suffixes. The light stemmer approach aims not to depending on a predefined list of prefixes and suffixes. The light stemmer approach aimsextract nottoto extract the root of a given Arabic word; hence, this approach does not deal with infixes or does the root of given word; hence, this approach does not deal or does notor recognize extract thea root ofArabic a given Arabic word; hence, this approach doeswith not infixes deal with infixes doesnot not recognize patterns. Several light stemmers have proposed for such patterns. Several light stemmers have been proposed for the Arabic language such aslanguage [26,27]. However, recognize patterns. Several light stemmers havebeen been proposed forthe theArabic Arabic language suchasas [26,27]. However, light stemming has ininstripping off ororsuffixes ininArabic. InInsome light stemming has difficulty in stripping off prefixes or suffixes in Arabic. In some cases, [26,27]. However, light stemming hasdifficulty difficulty stripping offprefixes prefixes suffixes Arabic.removal some cases, removal aafixed prefixes and without checking ififthe remaining word isislead aastem of a fixed set ofofof prefixes and suffixes without checking if the remaining word is a stem can to cases, removal fixedset setofof prefixes andsuffixes suffixes without checking the remaining word stem can lead to unexpected results, especially when distinguishing extra letters from root letters unexpected results, especially when distinguishing extra letters from root letters is difficult. can lead to unexpected results, especially when distinguishing extra letters from root lettersisis difficult. According to [26], which compared the performance of light stemmer and root base stemmer and difficult. According toto[26], compared the ofof light stemmer and base stemmer concluded that the light stemmer significantly outperforms the root base stemmer, we adopted the According [26],which which compared theperformance performance light stemmer androot root base stemmer and concluded that the light stemmer significantly outperforms the root base stemmer, we adopted light stemmer [27] as the preprocessing task for Arabic document categorization in this study. and concluded that the light stemmer significantly outperforms the root base stemmer, we adopted the thelight lightstemmer stemmer[27] [27]asasthe thepreprocessing preprocessingtask taskfor forArabic Arabicdocument documentcategorization categorizationininthis thisstudy. study.

Preprocessing PreprocessingAlgorithm: Algorithm: Step Step1:1:Remove Removenon-Arabic non-Arabicletters letterssuch suchasaspunctuation, punctuation,symbols, symbols,and andnumbers numbersincluding including{.,{.,:,:,/,/,!,!, §,&,_ , [, (, ,−, |,−,ˆ, ), ], }=,+, $, ∗, . . .}. §,&,_ , [, (, ,−, |,−,ˆ, ), ], }=,+, $, ∗, . . .}. Step Step2:2:Remove Removeall allnon-Arabic non-Arabicwords wordsand andany anyArabic Arabicword wordthat thatcontains containsspecial specialcharacters. characters. Step Step3:3:Remove Removeall allArabic Arabicstop stopwords. words. Step “ّ “ ‫الشدة‬ ً◌},except Step4:4:Remove Removethe thediacritics diacriticsofofthe thewords wordssuch suchasas{ {ْ ,ْ ,ٍ ,ٍ ,ِ ,ِ ,ٌ ,ٌ ,ُ ,ُ ,َ, ً◌َ, }, except““ ◌ّ ◌ ‫الشدة‬. . ّ Step “ّ ‫الشدة‬ .. Step5:5:Duplicate Duplicateall allthe theletters lettersthat thatcontain containthe thesymbols symbols““ ◌◌ “‫الشدة‬ Step 6: Normalize certain letters in the word, which have many forms Step 6: Normalize certain letters in the word, which have many formstotoone oneform. form.For Forexample, example,the the

cases, removal removal of of aa fixed fixed set set of of prefixes prefixes and and suffixes suffixes without without checking checking ifif the the remaining remaining word word is is aa stem stem cases, can lead to unexpected results, especially when distinguishing extra letters from root letters is can lead to unexpected results, especially when distinguishing extra letters from root letters is difficult. difficult. According to to [26], [26], which which compared compared the the performance performance of of light light stemmer stemmer and and root root base base stemmer stemmer According and concluded that the light stemmer significantly outperforms the root base stemmer, we adopted and concluded Algorithms 2016, 9, that 27 the light stemmer significantly outperforms the root base stemmer, we adopted 8 of 17 the light light stemmer stemmer [27] [27] as as the the preprocessing preprocessing task task for for Arabic Arabic document document categorization categorization in in this this study. study. the Preprocessing Algorithm: Preprocessing Algorithm: Algorithm: Preprocessing

1: Remove as punctuation, punctuation, symbols, symbols, and and numbers numbers including including {., {., :,:,:, /, Step 1: 1: Remove non-Arabic non-Arabic letters letters such such as as symbols, and {., Step punctuation, numbers including /,/, !,!, §,&,_ , [, (, ,´, |,´,ˆ, ), ], }=,+, $, ˚, . . .}. ,−, |,−,ˆ, ), ], }=,+, $, ∗, . . .}. §,&,_ , [, (, ,−, |,−,ˆ, ), ], }=,+, $, ∗, . . .}. any Arabic Arabic word word that that contains contains special special characters. characters. Step 2: 2: Remove Remove all all non-Arabic non-Arabic words words and and any any Arabic word Step that contains special characters. Step 3: Remove all Arabic stop words. Step 3: Remove all Arabic stop words. Step 4: 4: Remove Remove the the diacritics diacritics of of the the words words such such as as {{ ْ ْ ,, ٍ ٍ ,, ِ ِ ,, ٌ ٌ ,, ُ ُ ,, َ,َ, ً◌ً◌},},, except except ◌ ‫الشدة‬.. ّ ّ ““ ‫الشدة‬ Remove except ““ ◌ Step Step 5: 5: Duplicate Duplicate all all the the letters letters that that contain contain the the symbols symbols ““ ◌ ◌ ‫الشدة‬... ّ ّ ““‫الشدة‬ Step Step 6: Normalize certain letters in the word, which have many forms to one oneone form. ForFor example, the Step 6: 6: Normalize Normalizecertain certainletters lettersininthe theword, word, which have many forms to form. example, Step which have many forms to form. For example, the normalization of “ ‫ء‬ ” (hamza), “ ‫آ‬ ” (aleph mad), “ ‫أ‬ ” (aleph with hamza on top), “ ‫ؤ‬ ” (hamza on waw), “‫””إإ‬ the normalization “º” (hamza), “” (aleph “”with (aleph withon hamza “¦”on(hamza normalization of “‫”ء‬of(hamza), “‫( ”آ‬aleph mad), “mad), ‫( ”أ‬aleph hamza top), “on ‫ ”ؤ‬top), (hamza waw), “on (alef with with hamza at the the bottom), and ‫( ””ئ‬hamza (hamza on ya) to ““‫( ””اا‬alef). (alef). Another example is the theexample normalization waw), “”hamza (alef with hamza at the bottom), and “on ¹”ya) (hamza on ya)Another to “ ” (alef). Another is the (alef at bottom), and ““‫ئ‬ to example is normalization of the letter “ ‫ى‬ ” to “ ‫ي‬ ” and the letter “ ‫ة‬ ” to “ ‫ه‬ ” when these letters appear at the end of a word. normalization “«”letter to “©“”‫”ة‬and letterthese “­” to “£” when at the end of of the letter “‫”ى‬oftothe “‫”ي‬letter and the to “‫ه‬the ” when letters appearthese at theletters end ofappear a word. 7: Remove Remove words words with with length length less less than than three three letters letters because because these these words words are are considered considered Step 7: aStep word. insignificant and will not affect the classification accuracy. Step 7: Remove words with length less than accuracy. three letters because these words are considered insignificant and will not affect the classification insignificant and will not affect the classification accuracy. Step 8: If the word length = 3, then return the word without stemming because attempting to shorten words with more than three letters can lead to ambiguous stems. Step 9: Apply the light stemming algorithm for the Arabic words list to obtain an Arabic stemmed word list. 4.2. Document Modeling This process is also called document indexing and consists of the following two phases: 4.2.1. Vector Space Model Construction In VSM, a document is represented as a vector. Each dimension corresponds to a separate word. If a word occurs in the document, then its weighting value in the vector is non-zero. Several different methods have been used to calculate terms’ weights. One of the most common methods is TFiDF. The TF in the given document measures the relevance of the word within a document, whereas the DF measures the global relevance of the word within a collection of documents [28]. In particular, considering a collection of documents D containing N documents, such that D “ td0 , . . . . . . , dn´1 u each document di that contains a collection of terms t will be represented as vectors in VSM as follows: ` ˘ dij “ t1j , t2j , . . . . . . , tij , j “ 1, . . . , m,

(1)

where m is the number of distinct words in the document di. The TFiDF method uses the TF and DF to compute the weight of a word in a document by using Equations (2) and (3): ˆ ˙ N IDF ptq “ log (2) DF pd, tq TFiDF pt, dq “ TF pt, dq ˚ IDF pd, tq

(3)

where TF (t,d) is the number of times term t occurs in document d, DF (d,t) is the number of documents that contain term t , and N is the total of all documents in the training set. 4.2.2. Feature Selection In VSM, a large number of terms (dimensions) are irrelevant to the classification task and can be removed without affecting the classification accuracy. The mechanism that removes the irrelevant feature is called feature selection. Feature selection is the process of selecting the most representative subset that contains the most relevant terms for each category in the training set based on a few criteria and of using this subset as features in DC [29].

Algorithms 2016, 9, 27

9 of 17

Feature selection aims to choose the most relevant words that distinguish (one topic or class from other classes) between classes in the dataset. Several feature selection methods have been introduced for DC. The most frequently used methods mainly include DF threshold [30], information gain [31], and chi-square testing (x2 ) [11]. All these methods organize the features according to their importance or relevance to the category. The top ranking features from each category are then chosen and represented to the classification algorithm. In our paper, we suggested chi-square testing (x2 ), which is defined as a well-known discrete data hypothesis testing method from statistics; this technique evaluates the correlation between two variables and determines whether these variables are independent or correlated [32]. x2 value for each term t in a category c can be defined by using Equations (4) and (5) [33]. ˘ ` ˘‰2 “ ` |Tr| ¨ P ptk , ci q ˚ P tk , ci ´ P ptk , ci q ˚ P tk , ci ` ˘ x ptk , ci q “ P ptk q ˚ P tk ˚ P pci q ˚ P pci q 2

(4)

and is estimated using x2 pt, cq “

N ˚ pAD ´ CBq2 pA ` Cq ˚ pB ` Dq ˚ pA ` Bq ˚ pC ` Dq

(5)

where A is the number of documents of category c containing the term t; B is the number of documents of other category (not c) containing t; C is the number of documents of category c not containing the term t; D is the number of documents of other category not containing t; and N is the total number of documents. The chi-square statistics show the relevance of each term to the category. We compute chi-square values for each term in its respective category. Finally, highly relevant terms are chosen. 4.2.3. Document Categorization In this step, the most popular statistical classification and machine learning techniques such as NB [34,35], KNN [36,37], and SVM [11,38] are suggested to study the influence of preprocessing on the Arabic DC system. The VSM that contains the selected features and their corresponding weights in each document of the training dataset are used to train the classification model. The classification model obtained from the training process will be evaluated by means of testing data. 5. Experiment and Results 5.1. Arabic Data Collection Unfortunately, there is no free benchmarking dataset for Arabic document categorization. For most Arabic document categorization research, authors collect their own datasets, mostly from online news sites. To test the effect of preprocessing on Arabic DC and to evaluate the effectiveness of the proposed preprocessing algorithm, we have used an in-house corpus collected from the dataset used in several published papers for Arabic DC and gathered from scanning well-known and reputable Arabic news websites. The collected corpus contains 32,620 documents divided into 10 categories of News, Economy, Health, History, Sport, Religion, Social, Nutriment, Law, and Technology that vary in length and number of documents. In this Arabic corpus, each document must be assigned to one of the corresponding category directories. The statistics of the corpus are shown in Table 4. Figure 1 shows the distribution of documents in each category in our corpus. The largest category contains around 6860 documents, whereas the smallest category contains nearly 530 documents.

Algorithms 2016, 9, 27

10 of 17

Table 4. Statistics of the documents in the corpus.

Algorithms 2016, 9, 27

Category Name

Number of Documents

News Economy Health History Sport Religion Social Nutriment Law Technology

6860 4780 2590 3230 3950 3470 3600 2370 1240 530

10 of 16

Total of documents in each 32,620 Figure 1 shows the distribution category in our corpus. The largest category contains around 6860 documents, whereas the smallest category contains nearly 530 documents.

Number of Documents 8000 7000 6000 5000 4000 3000 2000 1000 0

Figure 1. Distribution of documents in each category. Figure 1. Distribution of documents in each category.

5.2. Experimental Configuration and Performance Measure 5.2. Experimental Configuration and Performance Measure In this the documents in each category were first preprocessed by converting them to In this paper, thepaper, documents in each category were first preprocessed by converting them to UTF8 UTF8 encoding. For stop word removal, we used a file containing 896 stop words. This list includes encoding. For stop word removal, we used a file containing 896 stop words. This list includes distinct distinct stop words and their possible variations. stop words and their possible variations. Feature selection was performed using chi-square statistics. The number of features for building Feature selection was performed using chi-square statistics. number of features for the building the vectors representing the documents included 50, 100, 500,The 1000, and 2000. In doing so, effect the vectors documents included 50, 100, 500, 1000,aand 2000. Inofdoing so,size. the effect of of representing preprocessingthe task can be comparatively observed within wide range feature Feature preprocessing task can be comparatively observed within a wide range of feature size. Feature vectors vectors were built using the function TFiDF as described by Equations (2) and (3). In this SVM, NB, and KNN techniques were applied to (3). observe the preprocessing effect were built using the study, function TFiDF as described by Equations (2) and onstudy, improving accuracy. Cross-validation wastoused for allthe classification experiments, In this SVM,classification NB, and KNN techniques were applied observe preprocessing effect on which partitioned the complete collection of documents intoall10classification mutually exclusive subsets which called improving classification accuracy. Cross-validation was used for experiments, folds. Each fold contains 3262 of documents. One of the subsets is used as the test set, whereas the partitioned the complete collection of documents into 10 mutually exclusive subsets called folds. Each rest of the subsets are used as training sets. fold contains 3262 of documents. One of the subsets is used as the test set, whereas the rest of the We have developed our application using JAVA Programming for implement preprocessing subsets are used as training sets. tasks on the dataset. We used RapidMiner 6.0 software (HQ in Boston MA USA) to build the We have developed ourthat application using JAVA Programming for implement preprocessing tasks classification model will be evaluated by means of the testing data. RapidMiner is an open-source on the dataset. We used RapidMiner 6.0 software (HQ in Boston MA USA) to build the classification software which provides an implementation for all classification algorithms used in our experiments. model that will evaluated by means of thefortesting data. RapidMiner is an open-source software Thebe evaluation of the performance classification model to classify documents into the correct category conducted by using mathematic rules such used as recall, precision, and F-measure, which provides anisimplementation for several all classification algorithms in our experiments. which are defined as follows: P recisio n =

Recall =

TP TP + FP

TP TP + FN

(6)

(7)

where TP is the number of documents that are correctly assigned to the category, TN is the number of documents that are correctly assigned to the negative category, FP is the number of documents a

Algorithms 2016, 9, 27

11 of 17

The evaluation of the performance for classification model to classify documents into the correct category is conducted by using several mathematic rules such as recall, precision, and F-measure, which are defined as follows: TP Precision “ (6) TP ` FP Recall “

TP TP ` FN

(7)

where TP is the number of documents that are correctly assigned to the category, TN is the number of documents that are correctly assigned to the negative category, FP is the number of documents a system incorrectly assigned to the category, and FN are the number of documents that belonged to the category but are not assigned to the category. The success measure, namely, micro-F1 score, Algorithms 2016, 9, 27 11 of 16 a well-known F1 measure, is selected for this study, which is calculated as follows:

Precision × Recall Recall 22* ˚ Precision Micro-F1 Micro-F1 “= Precision ` Recall Recall Precision+

(8) (8)

5.3. Experimental Results and Analysis 5.3. Experimental Results and Analysis First, we ran three different experiments to study the effect of each of the preprocessing tasks. First, experiments we ran three were different experiments study the effect of each of theofpreprocessing tasks. The three conducted usingtofour different representations the same dataset. The three experiments were conducted using four different representations of the same dataset. The The original dataset without any preprocessing task was used in the first experiment, and this original dataset anyaspreprocessing wassecond used inexperiment, the first experiment, and this experiment experiment was without presented the baseline. task In the normalization was used as was presented as the baseline. In the second experiment, normalization was used as the the preprocessing task of documents. The stop word removal was used in the third experiment. preprocessing task of documents. The stop word removal was used in the third experiment. In for the In the fourth and last experiment, we used light stemmer [27]. The results of the experiments fourth and last weindividually used lighton stemmer The results of the are experiments preprocessing taskexperiment, when applied the three[27]. classification algorithms illustrated for in preprocessing task when applied individually on the three classification algorithms are illustrated in Table 5 and Figure 2. Table 5 and Figure 2. Table 5. F1 Measure scores for preprocessing tasks when applied individually. Table 5. F1 Measure scores for preprocessing tasks when applied individually. Classification

Classification Algorithm Algorithm NB

NB

KNN

KNN

SVM

SVM

Feature

Feature Size Size 50 50 100 100 500 1000 500 2000 1000 50 2000 100 50 500 100 1000 500 2000 1000 50 2000 100 500 50 1000 100 2000 500 1000 2000

Baseline

Baseline (BS) (BS) 0.65 0.65 0.7737 0.7737 0.8841 0.9089 0.8841 0.9252 0.9089 0.6211 0.9252 0.7319 0.6211 0.7611 0.7319 0.7600 0.7611 0.6359 0.7600 0.7074 0.6359 0.7781 0.9311 0.7074 0.9559 0.7781 0.9648 0.9311 0.9559 0.9648

Normalization

Normalization (NR) (NR) 0.6678 0.6678 0.7763 0.7763 0.8852 0.9093 0.8852 0.9287 0.9093 0.6541 0.9287 0.7552 0.6541 0.7807 0.7552 0.7674 0.7807 0.6663 0.7674 0.7219 0.6663 0.8063 0.9330 0.7219 0.9530 0.8063 0.9619 0.9330 0.9530 0.9619

Stop Word Stop Word Removal (SR)

Removal(SR) 0.6511 0.6511 0.7881 0.7881 0.8896 0.9130 0.8896 0.9319 0.9130 0.6319 0.9319 0.7411 0.6319 0.7741 0.7411 0.7230 0.7741 0.6085 0.7230 0.7141 0.6085 0.7915 0.9348 0.7141 0.9585 0.7915 0.9657 0.9348 0.9585 0.9657

Light Light Stemmer (LS)

Stemmer (LS)

0.7311 0.7311 0.8107 0.8107 0.8807 0.8889 0.8807 0.9022 0.8889

0.7681 0.9022 0.8196 0.7681 0.8507 0.8196 0.8378 0.8507 0.8119

0.8378 0.8019 0.8119 0.8767 0.9507 0.8019 0.9630 0.8767 0.9663 0.9507 0.9630 0.9663

SVM Algorithms 2016, 9, 27

50 100 500 1000 2000

0.7074 0.7781 0.9311 0.9559 0.9648

0.7219 0.8063 0.9330 0.9530 0.9619

0.7141 0.7915 0.9348 0.9585 0.9657

0.8019 0.8767 0.9507 0.9630 12 of 17 0.9663

Figure 2. Experimental results preprocessing task using NB, KNN, SVM classifiers. Figure 2. Experimental results preprocessing task using NB, KNN, SVM classifiers.

According to the method, applying normalization, stopstop word removal, and and lightlight According to proposed the proposed method, applying normalization, word removal, stemming has ahas positive impact on classification accuracy in general. As shown in Table 3, applying stemming a positive impact on classification accuracy in general. As shown in Table 3, applying lightlight stemming alone has ahas dominant impact andand provided a significant improvement in classification stemming alone a dominant impact provided a significant improvement in classification accuracy with SVM and KNN classifiers but has a negative impact when used with NB classifiers and when feature size increases. Similarly, stop word removal provided a significant improvement in classification accuracy with NB and SVM classifiers, but has a negative impact when used with KNN classifiers and when feature size increases. The latter conclusion is surprising because most studies on DC in literature apply stop words directly by assuming them irrelevant. The results showed that the normalization helped to improve the performance and provided a slight improvement in the classification accuracy. Therefore, normalization should be applied without depending on feature size and classification algorithms. Given that normalization helps in grouping the words that contain the same meaning, a smaller amount of features with further discrimination are achieved. However, stop word removal and stemming status may change depending on the feature size and the classification algorithms. Combining Preprocessing Tasks In the following experiments, we studied the effect of combining preprocessing tasks on the accuracy of Arabic document categorization. All possible combinations of the preprocessing tasks listed in Table 6 are considered during the experiments to reveal all possible interactions between the preprocessing tasks. Stop word removal, Normalization, Light stemming (abbreviated in Table 6 as NR, SR, and LS, respectively) are either 1 or 0 which refer to “apply” or “not-apply”. Table 6. Combinations of preprocessing tasks. No.

Preprocessing Tasks Combinations

1 2 3 4

Normalization (NR):1 | Stop-word removal (SR):1 | Stemming (LS):0 Normalization (NR):1 | Stop-word removal (SR):0 | Stemming (LS):1 Normalization (NR):0 | Stop-word removal (SR):1 | Stemming (LS):1 Normalization (NR):1 | Stop-word removal (SR):1 | Stemming (LS):1

In these experiments, three classification algorithms, namely, SVMs, NB, and KNN, were applied to find the most suitable combination of preprocessing tasks and classification approaches to deal with Arabic documents. The results of the experiments on the three classification algorithms are illustrated in Table 7 and Figure 3. The minimum and maximum Micro-F1 and the corresponding combinations of preprocessing tasks at different feature sizes are also included.

SVM

500 1000 2000

Algorithms 2016, 9, 27

Max Min Max Min Max

0.9522 0.9589 0.9633 0.9626 0.9674

1 1 1 1 1

0 1 0 1 0

1 0 1 0 1

13 of 17

Figure 3. Experimental results and the corresponding combinations of preprocessing tasks for NB, Figure 3. Experimental KNN, SVM Classifiers. results and the corresponding combinations of preprocessing tasks for NB, KNN, SVM Classifiers. Table 7. Experimental results and the corresponding combinations of preprocessing tasks.

Considering the three classification algorithms, the difference between the maximum and minimum Micro-F1 for all preprocessing combinations at each feature size ranged from 0.0044 to Preprocessing Combination Max vs. Min Feature Size Micro-F1 Classifier 0.2248. In particular, the difference was between 0.0066 and 0.0596NR in NB classifier, between SR LS 0.0663 and 0.2248 in KNN classifier, and between 0.0044 0.6730 and 0.0808 in 1SVM classifier. The amount of Min 1 0 variation in accuracies50 proves that theMax appropriate combinations of preprocessing tasks depending 0.7326 1 0 1 Min 0.7804 1 accuracy. 0 On the on the classification algorithm and feature size may significantly1 improve the 100 0.8141 0 1reduce the 1accuracy. contrary, inappropriate combinations Max of preprocessing tasks may significantly Min 0.8793 1 1 1 TheNB impact of preprocessing on Arabic document categorization was also statistically analyzed 500 Max 0.8859 1 1 0 using a one-way analysis of variance (ANOVA) over0.8896 the maximum0and minimum Micro-F1 Min 1 1 at each 1000 value = 0.05. p-values were obtained as 0.001258, 0.004316, and 0.203711 for feature size with an alpha Max 0.9063 1 1 0 Minresults have 0.9022 1 hypothesis 0 that performance 1 NB, SVM, and KNN,2000 respectively. The supported our 0.9252level of 0.05. 1 1 0 differences are statistically significantMax with a significance Min combinations, 0.6526 1 1 0 We also compared50the preprocessing which provided the maximum accuracy for Max 0.7648 1 0 1 each classification technique, and the preprocessing combinations, which provided the maximum Min 0.7593 1 1 0 100feature size for each classification technique. As shown in Table 8, accuracy at minimum Max 0.8278 0 1 1 normalization and stemming should Min be applied to achieve accuracy or 0minimum 0.7874 either maximum 1 1 KNN 500 Maxfor KNN and 0.8537 1 regardless 0 of the feature 1 feature size with the maximum accuracy SVM classifiers size. Min 0.7007opposite for 1 the NB classifier. 1 0 On the contrary, the statuses of preprocessing tasks were 1000 2000 50 100 SVM

500 1000 2000

Max Min Max

0.8467 0.5937 0.8185

1 1 1

1 1 1

1 0 1

Min Max Min Max Min Max Min Max Min Max

0.7222 0.8030 0.8152 0.8756 0.9404 0.9522 0.9589 0.9633 0.9626 0.9674

1 1 1 1 1 1 1 1 1 1

1 0 1 0 1 0 1 0 1 0

0 1 0 1 0 1 0 1 0 1

Considering the three classification algorithms, the difference between the maximum and minimum Micro-F1 for all preprocessing combinations at each feature size ranged from 0.0044 to 0.2248. In particular, the difference was between 0.0066 and 0.0596 in NB classifier, between 0.0663 and

Algorithms 2016, 9, 27

14 of 17

0.2248 in KNN classifier, and between 0.0044 and 0.0808 in SVM classifier. The amount of variation in accuracies proves that the appropriate combinations of preprocessing tasks depending on the classification algorithm and feature size may significantly improve the accuracy. On the contrary, inappropriate combinations of preprocessing tasks may significantly reduce the accuracy. The impact of preprocessing on Arabic document categorization was also statistically analyzed using a one-way analysis of variance (ANOVA) over the maximum and minimum Micro-F1 at each feature size with an alpha value = 0.05. p-values were obtained as 0.001258, 0.004316, and 0.203711 for NB, SVM, and KNN, respectively. The results have supported our hypothesis that performance differences are statistically significant with a significance level of 0.05. We also compared the preprocessing combinations, which provided the maximum accuracy for each classification technique, and the preprocessing combinations, which provided the maximum accuracy at minimum feature size for each classification technique. As shown in Table 8, normalization and stemming should be applied to achieve either maximum accuracy or minimum feature size with the maximum accuracy for KNN and SVM classifiers regardless of the feature size. On the contrary, the statuses of preprocessing tasks were opposite for the NB classifier. Table 8. Preprocessing tasks (maximum accuracy vs. minimum feature size). Classification Technique Naïve Bayes KNN SVM

Preprocessing Tasks (Max. Accuracy)

Preprocessing Tasks (Min. Feature Size)

Feature Size

Max. Micro-F1

Preprocessing Tasks

Feature Size

Max. Micro-F1

Preprocessing Tasks

2000 500 2000

0.9319 0.8537 0.9674

NR: 0 | SR: 1 | LS: 0 NR: 1 | SR: 0 | LS: 1 NR: 1 | SR: 0 | LS: 1

50 50 50

0.7326 0.7648 0.8030

NR: 1 | SR: 0 | LS: 1 NR: 1 | SR: 0 | LS: 1 NR: 1 | SR: 0 | LS: 1

Furthermore, the findings of the proposed study in terms of contribution to the accuracy were compared against the previous works mentioned in the related works section. The comparison is presented in Table 9, where “+” and “´” signs, respectively, represent positive and negative effects, and “N/F” indicate that the corresponding analysis is not found. Table 9. The comparison of the influence of the preprocessing task in various studies and proposed study. Study

NR

SR

LS

Uysal et al. [7] Pomikálek et al. [6] Méndez et al. [8] Chirawichitchai et al. [9] Song et al. [4] Toman et al. [5] Mesleh [10,11] Duwairi et al. [15] Kanan [14] 2015 Zaki et al. [18] Al-Shargabi et al. [12] Khorsheed et al. [16] Ababneh et al. [17]

N/F N/F N/F N/F N/F N/F N/F N/F N/F N/F N/F N/F N/F

+ N/F + + N/F N/F N/F N/F + N/F N/F

+ (often) + N/F + + + + N/F N/F -

Proposed Study

+

- (often)

+ (often)

In general, the results clearly showed the superiority of the SVM over the NB and KNN algorithms for all experiments, especially when the feature size increases. Similarly, Hmeidi [39] compared the performance of KNN and SVM algorithms for Arabic document categorization and concluded that SVM outperformed KNN and showed a better micro average F1 and prediction time. SVM algorithm is superior because this technique has the ability to handle high dimensional data. Furthermore, this

Algorithms 2016, 9, 27

15 of 17

algorithm can easily test the effect of the number of features on classification accuracy, which ensures robustness to different dataset and preprocessing tasks. By contrast, the KNN performance degrades when the feature size increases. This assumption is due to certain features not clearly belonging to both categories taking part in the classification process. 6. Conclusions In this study, we investigated the impact of preprocessing tasks on the accuracy of Arabic document categorization by using three different classification algorithms. Three preprocessing tasks were used, namely, normalization, stop word removal, and stemming. The examination was conducted using all possible combinations of the preprocessing tasks considering different aspects such as accuracy and dimension reduction. The results obtained from the experiments reveal that appropriate combinations of preprocessing tasks showed a significant improvement in classification accuracy, whereas inappropriate combinations may degrade the classification accuracy. Findings of this study showed that the normalization task helped to improve the performance and provided a slight improvement in the classification accuracy regardless of feature size and classification algorithms. Stop word removal has negative impact when combined with other preprocessing tasks in most cases. Stemming techniques provided a significant improvement in classification accuracy when applied alone or combined with other preprocessing tasks in most cases. The results also clearly showed the superiority of the SVM over the NB and KNN algorithms for all experiments. As a result, the significant impact of preprocessing techniques on the classification accuracy is as important as the impact of feature extraction, feature selection, and classification techniques, especially with a highly inflected language such as the Arabic language. Therefore, for document categorization problems using any classification technique and in any feature size, researchers should consider investigating all possible combinations of the preprocessing tasks instead of applying them simultaneously or applying them individually. Otherwise, classification performance may significantly differ. Future research will consider introducing a new hybrid algorithm for the Arabic stemming technique, which attempts to overcome the weaknesses of state-of-the-art stemming techniques in order to improve the accuracy of DC system. Author Contributions: Abdullah Ayedh proposed the idea of this research work, analyzed the experimental results, and wrote the manuscript. Guanzheng Tan led and facilitated the further discussion and the revision. Hamdi Rajeh and Khaled Alwesabi have been involved in discussing and helping to shape the idea, and drafting the manuscript. All authors read and approved the final manuscript. Conflicts of Interest: The authors declare no conflict of interest.

References 1. 2. 3. 4. 5. 6.

Al-Kabi, M.; Al-Shawakfa, E.; Alsmadi, I. The Effect of Stemming on Arabic Text. Classification: An Empirical Study. Inf. Retr. Methods Multidiscip. Appl. 2013. [CrossRef] Joachims, T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features; Springer: Berlin, Germany, 1998. Nehar, A.; Ziadi, D.; Cherroun, H.; Guellouma, Y. An efficient stemming for arabic text classification. Innov. Inf. Technol 2012. [CrossRef] Song, F.; Liu, S.; Yang, J. A comparative study on text representation schemes in text categorization. Pattern Anal. Appl. 2005, 8, 199–209. [CrossRef] Toman, M.; Tesar, R.; Jezek, K. Influence of word normalization on text classification. Proc. InSciT 2006, 4, 354–358. Pomikálek, J.; Rehurek, R. The Influence of preprocessing parameters on text categorization. Int. J. Appl. Sci. Eng. Technol. 2007, 1, 430–434.

Algorithms 2016, 9, 27

7. 8.

9.

10.

11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.

24.

25.

26.

27. 28. 29.

16 of 17

Uysal, A.K.; Gunal, S. The impact of preprocessing on text classification. Inf. Proc. Manag. 2014, 50, 104–112. [CrossRef] Méndez, J.R.; Iglesias, E.L.; Fdez-Riverola, F.; Diaz, F.; Corchado, J.M. Tokenising, Stemming and Stopword Removal on Anti-Spam Filtering Domain. In Current Topics in Artificial Intelligence; Springer: Berlin, Germany, 2005; pp. 449–458. Chirawichitchai, N.; Sa-nguansat, P.; Meesad, P. Developing an Effective Thai Document Categorization Framework Base on Term Relevance Frequency Weighting. In Proceedings of the 2010 8th International Conference on ICT, Bangkok, Thailand, 24–25 November 2010. Moh’d Mesleh, A. Support Vector Machines Based Arabic Language Text Classification System: Feature Selection Comparative Study. In Advances in Computer and Information Sciences and Engineering; Springer: Berlin, Germany, 2008; pp. 11–16. Moh’d A Mesleh, A. Chi square feature extraction based SVMs Arabic language text categorization system. J. Comput. Sci. 2007, 3, 430–435. Al-Shargabi, B.; Olayah, F.; Romimah, W.A. An experimental study for the effect of stop words elimination for arabic text. classification algorithms. Int. J. Inf. Technol. Web Eng. 2011, 6, 68–75. [CrossRef] Al-Shammari, E.T.; Lin, J. Towards an Error-Free Arabic Stemming. In Proceedings of the 2nd ACM Workshop on Improving Non English Web Searching, Napa Valley, CA, USA, 26–30 October 2008. Kanan, T.; Fox, E.A. Automated Arabic Text. Classification with P-Stemmer, Machine Learning, and a Tailored News Article Taxonomy. J. Assoc. Inf. Sci. Technol. 2016. [CrossRef] Duwairi, R.; Al-Refai, M.N.; Khasawneh, N. Feature reduction techniques for Arabic text categorization. J. Am. Soc. Inf. Sci. Technol. 2009, 60, 2347–2352. [CrossRef] Khorsheed, M.S.; Al-Thubaity, A.O. Comparative evaluation of text classification techniques using a large diverse Arabic dataset. Lang. Resour. Eval. 2013, 47, 513–538. [CrossRef] Ababneh, J.; Almomani, O.; Hadi, W.; Al-omari, N.; Al-ibrahim, A. Vector space models to classify arabic text. Int. J. Comput. Trends Technol. 2014, 7, 219–223. [CrossRef] Zaki, T.; Es-saady, Y.; Mammass, D.; Ennaji, A.; Nicolas, S. A Hybrid Method N-Grams-TFIDF with radial basis for indexing and classification of Arabic documents. Int. J. Softw. Eng. Its Appl. 2014, 8, 127–144. Thabtah, F.; Gharaibeh, O.; Al-Zubaidy, R. Arabic text mining using rule based classification. J. Inf. Knowl. Manag. 2012, 11. [CrossRef] Zrigui, M.; Ayadi, R.; Mars, M.; Maraoui, M. Arabic Text. Classification framework based on latent dirichlet allocation. J. Comput. Inf. Technol. 2012, 20, 125–140. [CrossRef] Khoja, S. APT: Arabic Part-of-Speech Tagger. In Proceedings of the Student Workshop at NAACL, Pittsburghm, PA, USA, 2–7 June 2001; pp. 20–25. Duwairi, R.M. Arabic Text. Categorization. Int. Arab J. Inf. Technol. 2007, 4, 125–132. Nwesri, A.F.; Tahaghoghi, S.M.; Scholer, F. Capturing Out-of-Vocabulary Words in Arabic text. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), Sydney, Australia, 22–23 July 2006; pp. 258–266. Khoja, S.; Garside, R. Stemming Arabic Text. Computing Department, Lancaster University: Lancaster, UK, 1999; Available online: http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps (accessed on 14 April 2004). Kanaan, G.; Al-Shalabi, R.; Ababneh, M.; Al-Nobani, A. Building an Effective Rule-Based Light Stemmer for Arabic Language to Inprove Search Effectiveness. In Proceedings of the 2008 International Conference on Innovations in Information Technology, Al Ain, Arab Emirates, 16–18 December 2008; pp. 312–316. Aljlayl, M.; Frieder, O. On Arabic Search: Improving the Retrieval Effectiveness via a Light Stemming Approach. In Proceedings of the Eleventh International Conference on Information and Knowledge Management, McLean, VA, USA, 4–9 November 2002; pp. 340–347. Larkey, L.S.; Ballesteros, L.; Connell, M.E. Light Stemming for Arabic Information Retrieval. In Arabic Computational Morphology; Springer: Berlin, Germany, 2007; pp. 221–243. Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Proc. Manag. 1988, 24, 513–523. [CrossRef] Forman, G. An. Extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 2003, 3, 1289–1305.

Algorithms 2016, 9, 27

30. 31. 32.

33. 34.

35. 36. 37. 38. 39.

17 of 17

Zahran, B.M.; Kanaan, G. Text Feature Selection using Particle Swarm Optimization Algorithm. World Appl. Sci. J. 2009, 7, 69–74. Ogura, H.; Amano, H.; Kondo, M. Feature selection with a measure of deviations from Poisson in text categorization. Expert Syst. Appl. 2009, 36, 6826–6832. [CrossRef] Thabtah, F.; Eljinini, M.; Zamzeer, M.; Hadi, W. Naïve Bayesian Based on Chi Square to Categorize Arabic Data. In Proceedings of the 11th International Business Information Management Association Conference (IBIMA) Conference on Innovation and Knowledge Management in Twin Track Economies, Cairo, Egypt, 4–6 January 2009; pp. 4–6. Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. 2002, 34, 1–47. [CrossRef] El Kourdi, M.; Bensaid, A.; Rachidi, T.-E. Automatic Arabic document categorization based on the Naïve Bayes algorithm. In Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, Geneva, Switzerland, 28 Auguest 2004; pp. 51–58. Al-Saleem, S. Associative classification to categorize Arabic data sets. Int. J. Acm Jordan 2010, 1, 118–127. Syiam, M.M.; Fayed, Z.T.; Habib, M.B. An intelligent system for Arabic text categorization. Int. J. Intell. Comput. Inf. Sci. 2006, 6, 1–19. Bawaneh, M.J.; Alkoffash, M.S.; Al Rabea, A. Arabic Text Classification Using K-NN and Naive Bayes. J. Comput. Sci. 2008, 4, 600–605. [CrossRef] Alaa, E. A comparative study on arabic text classification. Egypt. Comput. Sci. J. 2008, 2. [CrossRef] Hmeidi, I.; Hawashin, B.; El-Qawasmeh, E. Performance of KNN and SVM classifiers on full word Arabic articles. Adv. Eng. Inform. 2008, 22, 106–111. [CrossRef] © 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

Suggest Documents