A Simple Study of Webpage Text Classification Algorithms for Arabic ...

3 downloads 38883 Views 275KB Size Report
languages algorithm for classifying Arabic and English Webpage text. and can perform ... and English languages to conclude which is the best algorithm that we. can apply for both ... methods use the bag-of-words representation because of its simplicity .... attempts to build a probabilistic classifier based on modeling the.
A Simple Study of Webpage Text Classification Algorithms for Arabic and English Languages Sumaia Mohammed AL-Ghuribi and Saleh Alshomrani Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia [email protected] , [email protected] Abstract— Webpage text Classification is an important problem that has been studied through different approaches and algorithms. It aims to assign a predefined category to a Webpage based on its content and linguistic features. It has many applications such as word sense disambiguation, document indexing, text filtering, Webpages hierarchical categorization and document organization. This study is a part of a work in progress, in which we are targeting to develop Bilanguages algorithm for classifying Arabic and English Webpage text and can perform accurate and efficient in both languages. It aims at providing a simple overview of many approaches that constructed for classifying Arabic and English Webpage documents. In this survey, the widely used algorithms for text classification are given with a comparison of the recent researches in classification field for Arabic and English languages to conclude which is the best algorithm that we can apply for both Arabic and English Languages. Keywords— Webpage text classification, linguistic features, Arabic language, Classification Algorithms .

I.

INTRODUCTION

Nowadays we have noticed an explosive growth on the Internet, with billions of Webpages on every topic and can be easily accessible through the Web. The rapid growth of Webpages, led to use data mining and its applications to organize Web documents and make them more useful for the users. One of the data mining applications is text classification (TC), Text classification is an old research area, it gained more interest in the 1990's because of the increasing amount of textual information and the improvement of computer technologies which enabled a more sophisticated algorithms to be implemented [19].Text classification (TC) is a Natural Language processing (NLP) problem; it can be defined as the assignment of unclassified document to one or more predefined categories based on their content [19]. It is also referred as Text categorization, document categorization, document classification or topic spotting [3]. TC techniques are used in many applications such as document filtering, spam filtering, e-mail filtering, news monitoring, mail routing, indexing, document organization and retrieval, and opinion mining. Generally, TC process goes through three phases: Data pre-processing, text classification and evaluation. Data preprocessing phase contains two tasks, document representation and feature selection. The main ways for representing document are a bag of words, in which a document is represented as a set of words, together with their associated frequency in the document or to represent text document directly as strings, in which each document is a sequence of words [30]. Most text classification methods use the bag-of-words representation because of its simplicity for classification purposes [30]. Feature selection achieves two main

goals [13]. First, it makes the training applied to a classifier more efficient by decreasing the high dimensionality of effective vocabulary. Second, feature selection often increases classification accuracy by redaction rare terms. In the next step of classification [19] , the documents are split into training and testing documents, the training documents are used to train the system (i.e. learn the system) to recognize different patterns of categories, the testing documents are used to evaluate the system, the process of categorization depends on the algorithm used using evaluation measures i.e. recall, precision, F1-measure. There are a lot of algorithms that are used as classifiers but the main algorithms (Classifiers) that are used widely are [42, 27, 30] Decision Tree Classifier, Support Vector Machine Classifier, Naïve Bayes Classifier, Neural Networks Classifier and K-Nearest Neighbors Classifier. Most of the TC researches are designed and tested for English languages; some TC approaches were carried out for other European languages such as German, Italian, Spanish, Chinese and Japanese. However, there is a little TC work that carried out for Arabic language [3] due to the complex and rich nature of the Arabic language which faces many challenges when someone is interest in creating Arabic classifier. Some of the Arabic language characteristics are as follow, Arabic is a very rich language with complex morphology. It belongs to the family of Semitic languages. It differs from Latin languages morphologically, syntactically and semantically. The writing system of Arabic has 25 consonants and three long vowels that are written from right to left and change shapes according to their position in the word. Additionally, Arabic has short vowels (diacritics) written above and under a consonant to give it its desired sound and hence give a word a desired meaning, like fatha, kasra and damma. The Arabic language consists of three types of words; nouns, verbs and particles. Nouns and verbs are derived from a limited set of about 10,000 roots [43]. Furthermore, a stem may accept prefixes and/or suffixes in order to form the word [44]. So, Arabic language is highly derivative where tens or even hundreds of words could be formed using only one root. Furthermore, a single word may be derived from multiple roots [45]. In this study the main algorithms for classifying Arabic and English Webpage document are discussed and a comparison about the recent researches in classification using the main algorithms mentioned above or any other algorithms are presented to find which algorithm is best for Bi-languages classifier. The rest of this paper is organized as follow: section II presents five main algorithms for classifying Webpage document; section III gives a comparison among the recent researches in classification fields and finally we conclude this study in Section IV.

978-1-4799-2845-3/13/$31.00 ©2013 IEEE

II.

WEBPAGE TEXT CLASSIFICATION ALGORITHMS

A. Decision Tree Classifier Decision trees are one of the most powerful used inductive learning methods. They are one of the data mining induction approaches which are considered as one of the most widely used in classification process. Their robustness to noisy data and their capability to learn disjunctive expressions seem suitable for document classification [3]. They are designed with the use of a hierarchical division of the underlying data space with the use of different text features [30]. They are performed in two phases either tree building (top-down manner) or tree pruning (bottom-up manner). Decision tree method takes the data described by its features as input. It partitions the data of records recursively using breadth-first approach or depthfirst greedy approach until all the data items have assigned to a particular class. The output is a decision which assigns the suitable class for the input data. Some of the most well-known decision tree algorithms are [24] IDE3 (Iterative Dichotomiser 3) decision tree algorithm was introduced in 1986 by Quinlan Ross, C4.5 algorithm is an improvement of IDE3 algorithm, developed by Quinlan Ross (1993), CART (Classification and regression trees) was introduced by Breiman, (1984). SLIQ (Supervised Learning In Ques) was introduced by Mehta et al, (1996) and SPRINT (Scalable Parallelizable Induction of decision Tree algorithm) was introduced by Shafer et al, 1996.

B. Support Vector Machine Classifier Support Vector Machines (SVMs) is a new class of machine learning techniques which first introduced by Vapnik [32] and has been introduced in text classification by Joachims [6]. SVM is one of the most robust and successful classification Algorithms [40].The first use of SVM for Arabic text classification was presented by Abdelwadood Moh'd A MESLEH [41]. SVM is based on the principle of structural risk minimization [4]. SVM Classifiers attempt to partition the data space with the use of linear or non-linear delineations between the different classes [30]. The key in such classifiers is to determine the optimal boundaries between the different classes and use them for the purposes of classification [30]. Support Vector Machines have been applied successfully in many text classification tasks because of their principle advantages as follow[33]: robust in high dimensional spaces, in which over fitting does not affect so much the computation of the final decision margin; any feature is important even some features that could be considered as irrelevant ones have been found to be good when calculating the margin ; robust when there is a sparsely of samples and most text categorization problems are linearly separable. Additionally, SVM method is flexible and can easily be combined with interactive userfeedback methods [37].

C. Naïve Bayes Classifier A naive Bayes (NB) classifier is a simple probabilistic classifier which is used in classification process. It is based on applying Bayes' theorem which is a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data, with strong (naive) independence assumptions. The naive part of the classifier comes from the simplifying assumption that all terms are

conditionally independent of each other given a category. Because of this independence assumption, the parameters for each term can be learned separately and this simplifies and speeds the computation operations compared to nonnative Bayes classifiers [6]. NB classifier attempts to build a probabilistic classifier based on modeling the underlying word features in different classes. The idea is then to classify text based on the posterior probability of the documents belonging to the different classes on the basis of the word presence in the documents [30]. There are two common models for NB text classification, discussed by Mccallum and Nigam [38], multinomial model and multivariate Bernoulli model [6]. Both models essentially compute the posterior probability of a class, based on the distribution of the words in the document [30]. Some advantages of Naive Bayes are, NB requires a small amount of training data to estimate the parameters necessary for classification. Also, NB algorithm affords fast, highly scalable model building and scoring, and can be used for both binary and multiclass classification problems.

D. Neural Networks Classifier Neural networks have considered as an important tool for classification. The recent vast research activities in neural classification have established that neural networks are a promising alternative to various conventional classification methods [39]. Neural network algorithm consists of an input layer, one or more hidden layers and an output layer [40].The basic unit in a neural network is a neuron or unit. The inputs to the network correspond to the attributes measured for each training tuple which are fed into the units making up the input layer. They are then weighted and fed simultaneously to a hidden layer. The number of hidden layers is arbitrary, although usually only one. The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's prediction [40]. Neural networks have many advantages , we can summarize some of them as [39] states as follow; first, they are data driven self-adaptive methods in that they can adjust themselves to the data without any explicit specification of functional or distributional form for the underlying model. Second, they are universal functional approximates in that neural networks can approximate any function with arbitrary accuracy. Third, neural networks are nonlinear models, which makes them flexible in modeling real world complex relationships. Finally, neural networks are able to estimate the posterior probabilities, which provide the basis for establishing classification rule and performing statistical analysis.

E. K-Nearest Neighbors Classifier K-nearest neighbor (k-NN) classification is a nonparametric lazy learning algorithm and a well-known statistical approach which considered among the oldest nonparametric classification algorithms [40]. It is based on closest training examples in the feature space. The main thesis in K-NN is that documents which belong to the same class are likely to be close to one another based on similarity measures such as the dot product or the cosine metric [30]. KNN assumes that the data is in a feature space, since the points are in feature space, they have a notion of distance. Also it assumes that each of the training data consists of a set of vectors and class label associated with each vector. A single number "k" is given; this number

decides how many neighbors influence the classification. Although, K-NN classifier is a very simple classifier that works well on basic recognition problems, it has a set of drawbacks [6]; K-NN is a lazy learning example-based method that does not have an off-line training phase. The main computation is the on-line scoring of training documents given a test document in order to find the k nearest neighbors, this makes k-NN not efficient because nearly all computation takes place at classification time rather than when the Approach 1. (Kwon & Lee,2000) 2. ( Nigam et al.,2000) 3. (Tsukada et al.,2001) 4. (Sun et al.,2002) 5. (Selamat et al.,2003) 6. (El Kourdi et al.,2004) 7. (Al-Shalabi et al.,2006) 8. (Chandra&Paul,2006) 9. (El-Halees,2007) 10. (Shi et al.,2008) 11. (Zhongli&Zhijing,2008) 12. (Thabtah et al.,2008) 13. (Al-Shalab&Obeidat,2008)

14. (Al-Harbi et al.,2008)

16. (Thabtah et al.,2009) 17. (Yazdani et al.,2009) 18. (Kamruzzaman,2010) 19. (Zhong & Zou , 2011) 20. (Alsaleem,2011)

22. (Youquan et al.,2011)

23. (Chantar et al.,2011) 24. (Zhaohui et al.,2011)

COMPARISON OF WEBPAGE TEXT CLASSIFICATION APPROACHES

Evaluation Data Set

Document Represent.

Language

KNN

Bag-of- words

English

Expectation-Maximization (EM) & NB

Set-of-words

English

Decision tree

Bag-of- words

English

SVM

Set-of-words

English

Neural network

Bag-of- words

English

NB

Bag-of- words

Arabic

KNN

Bag-of- words

Arabic

Decision trees

Set-of-words

English

Maximum Entropy

Set-of-words

Arabic

(hybrid) KNN &SVM

Bag-of- words

NB KNN KNN SVM

Precision

Recall

F1measure

8,702

N/A

N/A

N/A

10

82

N/A

N/A

14

79.9

48.2

60.14

4

N/A

N/A

56.2

11

84.10

N/A

N/A

5

68.78

N/A

N/A

6

95

95

95

N/A

75.7

N/A

N/A

N/A

6

80.34

80.48

80.41

English

N/A

N/A

90

97

93

Bag-of- words

English

4,338 Webpages

11

91.38

94.09

92.13

Bag-of- words

Arabic

N/A

6

N/A

N/A

94.91

Arabic

1445 Webpages

9

Arabic

17,658 Webpages

7

Bag-of- words N-Grams Bag-of- words Bag-of- words

Arabic

NB

Set-of-words

Arabic

NB & an extended version of Hidden Markov Model

Bag-of- words

English

Artificial Neural Network

Relational

English

SVM

Bag-of- words

English

Set-of-words

Arabic

SVM

Measure Factors # of Classes

SVM

NB 21. (Ting et al.,2011)

III.

Classifier

Decision tree 15. (Gharib et al.,2009)

training examples are first encountered. Moreover, k-NN classifier has a major drawback of selecting the value of k; the success of classification is very much dependent on this value.

Data Set 32,442 Webpages UseNet, WebKB, newswire 5 domains of Yahoo 4159 Webpages 784 Webpages 300 Webpages 621 Webpages 12 datasets form UCI

70.66

64.54

66.88

87.62

63.41

73.57

68.65

N/A

N/A

78.42

N/A

N/A

6

N/A

N/A

~ 99

1132 Webpages 1562 Webpages 86842 Webpages 500 homepages 3160 Webpages

6

~ 73

~ 72

~ 72

9

85.9

N/A

N/A

8

N/A

N/A

N/A

6

92.0

94.65

93.31

5121 Webpages

7

4000 Webpages 700 Webpages

77.9

77.8

77.8

74.1

74

74

4

95.6

95.5

95.5

5

92

N/A

N/A

NB

Set-of-words

English

NB

Bag-of- words

English

SVM NB Decision tree

Bag-of- words

Arabic

4381 Webpages

13

92.6 85.4 76.1

N/A N/A N/A

N/A N/A N/A

KNN

Bag-of- words words=links

English

3430 Webpages

9

99.6

71.3

83.0

25. (El-Halees,2011) 26. (Patil & Pawar ,2012) 27. (Mangai et al.,2012) 28. (Baraa et al. , 2012) 29. (ALZaghoul & Dhaheri,2013) 30. (Mangai et al.,2013)

KNN

Word/phrase

Arabic

NB

Set-of-words

English

Using Feature Intervals

Bag-of- words

English

Frequency Ratio Accumulation Method (FRAM)

N-Grams

Artificial Neural Network KNN

1143 Webpages 4887 Webpages

3

80.29

N/A

N/A

10

89.09

89.04

89.05

WebKB

2

93

N/A

N/A

Arabic

3172 Webpages

4

N/A

N/A

95.1

Bag-of- words

Arabic

2800 Webpages

7

N/A

N/A

~ 97

Bag-of- words

English

WebKB

N/A

96.31

N/A

N/A

Table 1: Comparison of Webpage text classification approaches. To determine which algorithm will be suitable in classifying both Arabic and English Webpage text and give high accuracy, we take the precision column above and make a graph as Figure1 shows.

100 95

Accuracy

90 85 80 75 70

2 3 5 6 7 8 9 10 11 B_13 N_13 S_14 D_14 16 17 19 S_20 N_20 21 22 S_23 N_23 D_23 24 25 26 27 30

65

Approach # FIGURE 1 ACCURACY OF SOME APPROACHES FOR CLASSIFYING WEBPAGE TEXT

From approaches [14, 20, 23] we can get that SVM classifier outperforms NB and Decision tree Classifiers. Additionally, high accuracy is appear clearly when SVM or KNN classifiers are used such as [24, 19, 15, 7] which the accuracy ranges between 92% and 99%. But it is difficult to compare these two classifiers unless they are tested to the same corpora. In [46] Ismail made a comparative study of two machine learning methods (KNN and SVM) on Arabic text classification , results show that SVM gives better f1-measure value and prediction time than KNN. Also in [47] MESLEH made a text classification system for Arabic language articles using SVM classifier and compare it with NB classifier and KNN classifier , results show that SVM outperforms the other classifiers. In [48] El-Halees applied machine learning classifiers (NB, SVM , KNN) to Arabic and English corpora , results prove that SVM gives best F1-measure and accuracy. Additionally, in [49] Rashid designed a tool for comparing different classification algorithms that are used to classify different Arabic texts and his experiments result that SVM gives higher accuracy than KNN. Also the same as in English text SVM outperforms KNN as Siolas's

experiments in [50].As a result, we can conclude that SVM classifier is the best choice for our Bi-languages Algorithm.

I.

CONCLUSION AND FUTURE WORK

With the enormous increasing of Webpages, Webpage text classification becomes one of the important topics to improve many applications while using Webpage as a source of knowledge. In this study, we presented the widely used algorithms for classifying Webpage text and make a comparison for the available researches in classifying Arabic and English Webpage documents to determine which classifier will enhance the Bi-languages algorithm that we want to create for classifying Arabic and English Webpage text. As a future work, we will try to construct a classifier which performs efficiently in Arabic and English Webpage text and gives accurate results in the two languages.

REFERENCES [1] Zhong Shaobo and Zou Dongsheng, "Web Page Classification using an Ensemble of Support Vector Machine Classifiers," JOURNAL OF NETWORKS, VOL. 6, NO. 11, NOVEMBER 2011. [2] Shi Xuelin, Zhao Ying and Dong Xiangjun, "Web Page Categorization Based on k-NN and SVM Hybrid Pattern Recognition Algorithm,"In Proceedings of the 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery - Volume 02 (FSKD '08), Vol. 2. IEEE Computer Society, Washington, DC, USA, 523-527, 2008. [3] Li Y.H. and Jain A.K, "Classification of Text Documents," The Computer Jour., vol. 41, no. 8, 1998, pp. 537-546. [4] Alsaleem Saleh ," Automated Arabic Text Categorization Using SVM and NB," International Arab Journal of e-Technology, Vol. 2, No. 2, June 2011. [5] Sun Aixin, Lim Ee-Peng, and Ng Wee-Keong," Web classification using support vector machine," In Proceedings of the 4th international workshop on Web information and data management (WIDM '02). ACM, New York, NY, USA, 96-99. 2002. [6] Gharib Tarek, Habib Mena and Fayed Zaki," Arabic Text Classification Using Support Vector Machines," International Journal of Computers and Their Applications, 2009. [7] Kamruzzaman, S. M.," Web Page Categorization Using Artificial Neural Networks," 4 Pages, International Conference; Proc. 4th International Conference on Electrical Engineering, The Institution of Engineers, Dhaka, Bangladesh, pp. 96-99, Jan. 2006 , ArXiv e-prints 2010. [8] AL Zaghoul Fawaz and Al-Dhaheri Sami," Arabic Text Classification Based on Features Reduction Using Artificial Neural Networks," UKSim 15th International Conference on Computer Modelling and Simulation, 2013. [9] Selamat Ali,Omatu Sigeru, Yanagimoto Hidekazu, Fujinaka Toru and Yoshiok Michifumi ," Web page classification method using neural networks," Transactions of The Institute of Electrical Engineers of Japan 123(5), 1020–1026 (2003). [10] Ting S.L., Ip W.H. and Tsang Albert H.C.," Is Naïve Bayes a Good Classifier for Document Classification," International Journal of Software Engineering and Its Applications Vol. 5, No. 3, July, 2011. [11] Zhongli He and Zhijing Liu, "A Novel Approach to Naive Bayes Web Page Automatic Classification," Fuzzy Systems and Knowledge Discovery, 2008. FSKD '08. Fifth International Conference on, vol.2, no., pp.361, 365, 18-20 Oct. 2008. [12] Youquan He, Jianfang Xie and Cheng Xu, "An improved Naive Bayesian algorithm for Web page text classification," Fuzzy Systems and Knowledge Discovery (FSKD), 2011 Eighth International Conference on , vol.3, no., pp.1765,1768, 26-28 July 2011. [13] Thabtah Fadi, Eljinini Mohammed, Zamzeer Mannam and Hadi Wael," Nai ve Bayesian based on Chi Square to Categorize Arabic Data," In proceedings of The 11th International Business Information Management Association Conference (IBIMA) Conference on Innovation and Knowledge Management in Twin Track Economies, Cairo, Egypt. (2009). [14] Santra A. K. and Jayasudha S.," Classification of Web Log Data to Identify Interested Users Using Naïve Bayesian Classification," International Journal of Computer Science Issues (IJCSI); Jan2012, Vol. 9 Issue 1, p381, 2012. [15] Patil, Ajay S. and Pawar, B. V.," Automated Classification of Web Sites using Naive Bayesian Algorithm," Proceedings of the International MultiConference of Engineers & ;2012, Vol. 1, p466. 2012. [16] El Kourdi Mohamed, Bensaid Amine and Rachidi Tajje-eddine,"Automatic Arabic document categorization based on the Naïve Bayes algorithm,"In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages (Semitic '04). Association for Computational Linguistics, Stroudsburg, PA, USA, 51-58. 2004. [17] Mangai J. Alamelu, Wagle Satej Milind, and Kumar V. Santhosh ," A Novel Web Page Classification Model using an Improved k Nearest Neighbor Algorithm," rd International Conference on Intelligent Computational Systems (ICICS'2013) April 2930, 2013 Singapore.2013. [18] Chantar Hamouda K. and Corne David W, "Feature subset selection for Arabic document categorization using BPSO-KNN," in 'NaBIC' , IEEE, , pp. 546-551.2011. [19] Al-Shalabi Riyad, Kanaan Ghassan and Gharaibeh Manaf, "Arabic Text Categorization Using KNN algorithm", Proceedings of The 4th International Multiconference on Computer Science and Information Technology, Vol. 4, ., 2006. [20] Thabtah Fadi, Hadi Wa’el Musa and Al-shammare Gaith," VSMs with K-Nearest Neighbour to Categorise Arabic Text Data," Proceedings of the World Congress on Engineering and Computer Science 2008,WCECS 2008, October 22 - 24, 2008, San Francisco, USA.2008. [21] Al-Shalabi Riyad and Obeidat Rasha, "Improving KNN Arabic Text Classification with N-Grams Based Document Indexing," in Proceedings of the Sixth International Conference on Informatics and Systems, Cairo, Egypt, 2008. [22] Kwon Oh-Woog and Lee Jong-Hyeok," Web Page Classification Based on k-Nearest Neighbor Approach," In Proceedings of the fifth international workshop on on Information retrieval with Asian languages (IRAL '00). ACM, New York, NY, USA, 915.2000. [23] Chandra B. and Paul V Pallath, "A Robust Algorithm for Classification Using Decision Trees," Cybernetics and Intelligent Systems, 2006 IEEE Conference on , vol., no., pp.1,5, 7-9 June 2006. [24] Anyanwu Matthew N. and Shiva Sajjan G., "Comparative Analysis of Serial Decision Tree Classification Algorithms," Issue 3, vol. 3, IJCSS pp. 230-240.2009. [25] Nigam Kamal, Mccallum Andrew, Thrun Sebastian and Mitchell Tom, "Text Classification from Labeled and Unlabeled Documents Using EM, "Machine Learning, vol. 39, 103-134, 2000.

[26] Mangai J. Alamelu, Kothari Dipti D. and Kumar V. Santhosh ," A Novel Approach for Automatic Web Page Classification using Feature Intervals," IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 5, No 2, September 2012. [27] Baraa Sharef, Nazlia Omar1 and Zeyad Sharef," An Automated Arabic Text Categorization Based on the Frequency Ratio Accumulation," International Arab Journal of Information Technology (IAJIT), 2012. [28] Zhaohui Xu, Fuliang Yan, Jie Qin and Haifeng Zhu, "A Web Page Classification Algorithm Based on Link Information," Distributed Computing and Applications to Business, Engineering and Science (DCABES), 2011 Tenth International Symposium on , vol., no., pp.82,86, 14-17 Oct. 2011. [29] Tsukada Makoto, Washio Takashi and Motoda Hiroshi, "Automatic Web-page classification by using machine learning methods", Proc. Web Intelligence: Research and Development, vol. 2198, pp.303 -313 2001. [30] Aggarwal Charu C, Zhai Changing," A survey of text classification algorithms," In: Mining Text Data, pp. 163–213. Springer (2012). [31] Yazdani Majid, Eftekhar Milad and Abolhassani Hassan,"Tree-Based Method for Classifying Websites Using Extended Hidden Markov Models," In Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD '09), Thanaruk Theeramunkong, Boonserm Kijsirikul, Nick Cercone, and Tu-Bao Ho (Eds.). Springer-Verlag, Berlin, Heidelberg, 780-787. 2009 [32] Vapnik V. N., "The nature of statistical learning theory," Springer Verlag, Heidelberg, DE, 1995. [33] Thorsten Joachims, "Text categorization with support vector machines: learning with many relevant features, "In Proceedings of the 10th European Conference on Machine Learning ECML-98, Chemnitz, Germany. Pages 137–142. 1998. [34] Al-Harbi S., Almuhareb A., Al-Thubaity A., Khorsheed M. S., and Al-Rajeh A., "Automatic Arabic Text Classification", 9es Journées internationales d'Analyse statistique des Donnés Textuelles, JADT'08, France, pp. 77-83. 2008. [35] El-Halees Alaa ,"Arabic Opinion Mining Using Combined Classification Approach," In: Proceedings of the International Arab Conference on Information Technology, ACIT (2011). [36] El-Halees Alaa M., “Arabic Text Classification Using Maximum Entropy”, The Islamic University Journal, Vol. 15, No. 1, pp 157-167. , 2007. [37] Raghavan H. and Allan J.," An interactive algorithm for asking and incorporating feature feedback into support vector machines," ACM SIGIR Conference, 2007. [38] Mccallum Andrew and Nigam Kamal, "A comparison of event models for naive bayes text classification," In AAAI-98 Workshop on Learning for Text Categorization. Pages 41-48. 1998. [39] Zhang, G.P., "Neural networks for classification: a survey," Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on , vol.30, no.4, pp.451,462, Nov 2000. [40] Navadiya Darshna and Patel Roshni, "Web Content Mining Techniques-a Comperhensive Survey, " International Journal of Engineering Research & Technology (IJERT), 2278-0181 Vol. 1 Issue 10, December- 2012. [41] MESLEH ABDELWADOOD," Support Vector Machines based Arabic Language Text Classification System: Feature Selection Comparative Study," 12th WSEAS Int. Conf. on APPLIED MATHEMATICS, Cairo, Egypt, December 29-31. 2007. [42] Sanan Majed, Rammal Mahmoud and Zreik Khaldoun,"Arabic supervised learning method using N-gram," Interactive Technology and Smart Education,Vol. 5 No. 3, 2008. [43] Building a shallow Arabic morphological analyzer in one day. In Proc. Of the 40th Annual Meeting of the Association for Computational Linguistics (ACL’02) (pp. 1–8). [44] Darwish, K. (2003). Probabilistic methods for searching OCR-degraded Arabic text. Ph.D. Thesis, Electrical and Computer Engineering Department, University of Maryland, College Park. [45] Ahmed, Mohamed Attia, “A Large-Scale Computational Processor of the Arabic Morphology, and Applications.” A Master’s Thesis, Faculty of Engineering, Cairo University, Cairo, Egypt, 2000. [46] Ismail Hmeidi, Bilal Hawashin, Eyas El-Qawasmeh, "Performance of KNN and SVM classifiers on full word Arabic articles," Advanced Engineering Informatics, Volume 22, Issue 1, January 2008, Pages 106-111, ISSN 1474-0346. [47] Mesleh, A. A, "Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System," Journal of Computer Science (3:6), 2007, pp.430-435. [48] Halees A., "Filtering Spam E-Mail from Mixed Arabic and English Messages: A Comparison of Machine Learning Techniques", The International Arab Journal of Information Technology, vol. 6, no. 1, pp. 52-59, 2009. [49] Rashid Jayousi1and Nabil Salhi," Arabic Text Classification: A Unified Test Bed Environment," Al-Quds University, Computer Science Department, Jerusalem, Palestine.2008. [50] Siolas, G.; d'Alche-Buc, F., "Support Vector Machines based on a semantic kernel for text categorization," Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEEINNS-ENNS International Joint Conference on , vol.5, no., pp.205,209 vol.5, 2000.

Suggest Documents