Journal of Applied Computer Science & Mathematics, no. 18 (8) /2014, Suceava
Arabic Text Categorization Using Improved k-Nearest neighbour Algorithm 1
Wail Hamood KHALED, 2Haytham Saleem AL-SARRAYRIH, 3Lars KNIPPING 1
2,3
Saudi Electronic University, Saudi Arabia Berlin Institute of Technology, Faculty of Mathematics and Natural Sciences, Germany 1
[email protected] [email protected] [email protected]
Abstract–The quantity of text information published in Arabic language on the net requires the implementation of effective techniques for the extraction and classifying of relevant information contained in large corpus of texts. In this paper we presented an implementation of an enhanced k-NN Arabic text classifier. We apply the traditional k-NN and Naive Bayes from Weka Toolkit for comparison purpose. Our proposed modified k-NN algorithm features an improved decision rule to skip the classes that are less similar and identify the right class from k nearest neighbours which increases the accuracy. The study evaluates the improved decision rule technique using the standard of recall, precision and f-measure as the basis of comparison. We concluded that the effectiveness of the proposed classifier is promising and outperforms the classical k-NN classifier. Keywords: Improved k-NN, k-Nearest Neighbours, KNN, Text classification.
I. INTRODUCTION With the increasing growth of online information, text classification has become one of the techniques for organizing text data. Many research have been conducted on classifying English documents and documents of other languages such as European languages (German, Italian, French, and Spanish) and Asian languages (Chinese and Japanese), comparing to these languages, few researches done on classifying Arabic documents [1, 2]. The traditional KNN text classification algorithm has some problems, it is considered time consuming technique because it requires examining all the training documents to classify a test document. Also, it requires traversing all keywords in a specific document so it’s considered not practical for some applications such as dynamic web mining in large repositories. Another problem is that there is “no weight difference” between all training documents [3, 4]. Applying the traditional decision rule of “k-NN algorithm on few classes containing most of the documents in the corpus and others having very few documents may mislead the result”, because the selection of this mechanism depends on the class of most documents in k-nearest neighbour [4]. A) Research Background Text classification goes back to the beginning of the sixties. It has become a subfield of information systems at early nineties due to the increased interest in text
classification and the improvements in technology [5]. In the early eighties, most of the researches were aimed at finding new methods to store, represent and retrieve relevant information from small numbers of documents. Undoubtedly, there were several factors that prevented progress in the classification represented in hardware limitations and construction cost of classification systems [6]. Thus, most of the approaches to text classification were rule-based. They depend on human experts in classification by defining classification rules manually [5]. In the early nineties, the Internet has become accessible to nearly everyone, the hardware capabilities have developed and the number of systems that require text classification has increased. Researchers in the field of language processing increasingly worked on finding practical text classification methods to access and retrieve relevant documents [6]. Most research in data mining focus on structured data stored in database. However, the increased number of unstructured data stored in text form gained from different sources, requires to organize these documents into classes to facilitate retrieval and extraction of information of interest [1]. Recent information systems contain a huge number of documents using manual classification – a complicated and difficult task as it requires a human classifier to have full knowledge of all documents topics that the system deals with [6]. Rule-based approaches have started to fade in favour of machine learning approaches in which categorization is automated [2, 6]. Automatic Text Classification is defined as classifying unlabelled documents into predefined categories based on its contents [7]. Another definition describes text classification as identifying the membership of testing documents to which predefined class it belongs to [8]. Here, predefined classes refer to the classes used to train classifiers, and testing classes refer to the documents to be classified [9, 10]. Text classifications techniques have been used in a broad range of applications such as spam filtering, mail routing, news monitoring and so on [2, 11]. Algorithms utilized in text classification are numerous, such as, “Artificial Neural Networks, Support Vector Machine, Naïve Bayes, K-Nearest Neighbour, and Decision Trees” [2]. K-nearest neighbour is one of the most important algorithms for text classification [2, 6] and one of the machine learning techniques which use predefined data to train the classifier [7].
9
Computer Science Section
B) Arabic language Most work in text classification have been done on English language, and only few research has been done on Arabic language due to several factors, probably the main one is the complexity of the Arabic language because of its highly inflectional and derivational nature makes morphological analysis a difficult task. Arabic language is one of the Semitic languages. Unlike the English language it is written from right to left, it has 28 letters, in addition to Hamza “”ء, the vowels of the language are “ ي, و,“ا, the remaining letters are consonants [2, 12]. Some Arabic letters appear differently in words depending on their position at word (initially, medially, finally) and if they are connected to another letter. For example the letter “ ”صis written as “ ”ـــﺼـwhen it appears in the middle of a word (such as “”ﻣﺼﺪر, meaning source), as “ ”ﺻـــwhen at the beginning of a word (like in hunter, “)“ﺻﻴﺎد, and as “ ”ــــﺺwhen at the end of a words and connected to another letter (like in dancing, “)”رﻗﺺ. As another example, the letter “ ”صis represented as “ ”صwhen it is at the end of the word and disconnected from the previous letter (like in chances, “] (”ﻓﺮص12]. II. RELATED WORK Many recent research tried to solve problems of classical k-NN algorithms for English text while for Arabic language are only few. In [13] the authors investigated of choice k parameter as a fixed k value will result in a bias for large categories. They used different k values for different categories instead of a fixed k value across all categories. An experiment on Chinese text showed that this approach performs better than the traditional k-NN algorithm. In [14] authors propose the Weight Adjusted k-Nearest Neighbour (WAKNN) classification Algorithm. WAKNN uses a weighted term frequency of keywords for similarity measure, adds the similarity for each class on k-nearest neighbour and selects the one with the highest value. The idea here is determining the features that are important and letting these features contribute to the similarity function. The results showed the Weight Adjusted k-Nearest Neighbour outperformed other methods such as “W C4.5, RIPPER, Rainbow, PEBLS, and VSM” [14]. Another enhanced k-NN algorithm which is based on improvements in indexing is presented in [2]. It suggests a kNN Arabic text classifier using word level N-grams (unigrams, i.e. single words, and bigrams, i.e. phrases) for indexing. The result compared with k-NN based on single term (bag of words) which supposes that the terms in the text are mutually independent, showed that KNN classifier based on N-grams gives better performance than k-NN classifier based on traditional single term. The improved k-NN algorithm in our paper differ from work in [2] and [14], It gives more preference to the classes
with more number of documents and at the same time ignores the classes that are far and identify the right class from k nearest neighbour. A) Document pre-processing phase Pre-processing aims to convert the documents into a form that is better suitable for classification algorithm. For Arabic text documents we use the following steps: • The documents are converted from different formats such as HTML, XML, or SGML to a plain text format in UTF-8 Unicode. • Tokenization: The document’s text is divided into a set of tokens (words), recognized by spaces and punctuation marks as word separators. • All digits, numbers, hyphens, and punctuation marks are removed [6]. • Normalization of some Arabic letters as follows: “”ء Hamza, “ ”أHamza on the top of Alef, “ ”إHamza under Alef, “”ﺁ, mad on the top of Alef, “”ؤ, Hamza on Wow, “ ”ئHamza on Ya, are normalized to the letter “”ا, Alef. The goal of this normalization is to standardize the Alef. Also following letters “ “ة,” ”يare normalized to “ “ﻩ,” ”ىrespectively [1, 6]. • All non-Arabic parts of the texts, like English words, are removed. • Arabic function words removed. The Arabic function words are the words that are “not useful e.g. The Arabic prefixes, pronouns, and prepositions”. Filtering can also be applied such as removing infrequent items [1]. After pre-processing, all documents are presented as a vector of word frequencies. B) An Enhanced k-NN algorithm Here, term frequency will be normalized within a document for the sum of all term frequencies will be equal to 1. The cosine similarity measure will be used to find the knearest neighbour. The cosine similarity measure considers only the keywords that exist in both testing and training documents. As an example, suppose a test document Q to have the keywords {T1, T2, T3} and a training document D1 to have the keywords {T1, T2, T3 }. In this case the cosine similarity will use only the keywords that appear in both Q and D1, which are {T1, T3}. Implementing k-buffer technique, the algorithm will not have to check each training document. Instead of considering each keyword in each document it detects k-nearest neighbour faster, as it includes only documents of training set that have high value in similarity with query document. In pre-processing step, term frequencies for all terms are calculated. For the k-buffer approach the similarity measures of the first k documents in the training set will be stored together with the documents name/id in an array of size k. Then it checks for each document in training corpus whether the
10
Journal of Applied Computer Science & Mathematics, no. 18 (8) /2014, Suceava
numerator of cosine similarity for the document greater than the lowest similarity value in the buffer array. In this case, the similarity measure will be computed, otherwise, the training document will simply be skipped to make the algorithm more efficient.
1 0 .9 0 .8
0. 90 5 0.8 4 0 .75 7
0.8 3
0 .1 0 p re ci s i o n
0 .3
Imp ro ved k‐ NN
An improved k-nearest neighbour implemented on Arabic text. This improved k-NN algorithm improves the performance in terms of accuracy. The experiment results indicate that the enhanced k-NN classifier in this paper is applicable to Arabic language. Naive Bayes classifier has the best accuracy, then the improved k-NN classifier comes in the second place and the worst classifier for this data set was traditional k-NN classifier. REFERENCES
[3]
0 .1 k=1 7
[4]
k= 50
Fig. 4.1 Accuracy results for the traditional and improved k-NN 1 0 .9
0 .8 9 0. 83 3
[5]
[6]
0 .8 0 .7 0 .6
0. 52 4
0 .5
T r a diti o n a lK nn
[7]
n aï v e Ba y e s
0 .4
Im pr o v e d K nn
0 .3 0 .2
[8]
0 .1 0 A cc ur ac y
fm e as u r e
III. CONCLUSIONS
0 .2 0
re ca ll
Fig.4.3 Precision, recall and f-measure for all three classifiers
Trad ition al_k ‐NN 0 .39
Im pr o ve d kNN
0 .2
[2]
0 .4
T r ad it ion al K nn n aï v e B ay e s
0 .3
0 .8 0 .6 0.52 4 0 .5
0 .4 86
0 .4
0 .813
0 .7
0 .5 08
0 .5
[1]
A ccur acy 0 .9
0 .89 4 0. 83 2
0 .7 0 .6
C) Experiments and Evaluation As shown in Figure 4.1, the improved k-NN classifier outperforms the traditional k-NN classifier for both tested value of k, also a minor improvement in accuracy can be observed for enhanced classifier for decreased k. The naive Bayes classifier shows the best accuracy, followed by the improved k-NN classifier (where k=17) and traditional k-NN classifier comes the last one regarding accuracy, as shown in Figure 4.2. The accuracy of improved K-NN classifier (0.833) close to the accuracy of Naive Bayes classifier (0.89), and this outperforms the traditional K-NN classifier (0.524). Figure 4.3 shows the precision, recall and f-measure for the three classifiers on the full corpus, the naive Bayes classifier shows the best values for all three measures and the improved k-NN classifier outperforms the traditional k-NN. As observed there is a disproportion in the precision, recall and f-measure values for the traditional k-NN, precision reaches (0.757) and f-measure (0.486), while these measures are close to each other (stable) for the other two classifiers, Naive Bayes and improved k-NN.
0 .89 0.8 39
[9]
Fig. 4.2 Accuracy results for traditional k-NN, naive bayes and improved k-NN
11
Al-Harbi S., Almuhareb A., Al-Thubaity A., Khorsheed M., Al-Rajeh A., “Automatic Arabic Text Classification,” ( JADT'2008 en ligne), pp 77-83, 2008. Al-Shalabi R., Obeidat R., “Improving KNN Arabic Text Classification with N-Grams Based Document Indexing,” In (INFOS2008): Proceedings of the Sixth International Conference on Informatics and Systems, 2008. Suguna N., Thanushkodi K., “An Improved k-Nearest Neighbor Classification Using Genetic Algorithm,” International Journal of Computer Science Issues, Vol. 7, No 2, pp 18-21, 2010. Miah M., “Improved k-NN Algorithm for Text Classification,” International conference on Data Mining (DMIN), 2009. Sebastiani F., “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, Vol. 34, No 1, pp 1–47, 2002. Bawaneh M., Alkoffash M., Al Rabea A., “Arabic Text Classification using K-NN and Naive Bayes,” Journal of Computer Science, Vol. 4, No 7, pp 600-605, 2008. Ikonomakis M., Kotsiantis S., Tampakas V., “Text Classification Using Machine Learning Techniques,” WSEAS Transactions on Computers, Vol 4, pp 966-974, 2005. Schütze H., Velipasaoglu E. Pedersen J., “Performance Thresholding in Practical Text Classification,” In CIKM ’06: Proceedings of the 15th ACM international conference on Information and knowledge management, pp 662–671, 2006. Hadi W., Salam M., Al-Widian., “Performance of NB and SVM Classifiers in Islamic Arabic Data,” In ISWSA '10: Proceedings of the ACM 1stInternational Conference on Intelligent Semantic Web-Services and Applications, 2010.
Computer Science Section
[10] Thabtah F., Eljinini M., Zamzeer M., Hadi W., “Naïve
[11]
[12]
[13]
[14]
Bayesian Based on Chi Square to Categorize Arabic Data,” In (IBIMA) :proceedings of The 11th International Business Information Management Association Conference on Innovation and Knowledge Management in Twin Track Economies, Vol. 10, pp. 930-935, 2009. Jbara K., “Knowledge Discovery in Al-Hadith Using Text Classification Algorithm,” Journal of American Science, Vol 6, No 11, pp 409-419, 2010. Duwairi R., “Arabic text categorization,” The international Arab Journal of Information Technology, Vol.4, No 2, pp 125132, 2007. Baoli L., Shiwen Y., “An Adaptive k-Nearest Neighbor Text Categorization Strategy,” Journal ACM Transactions on Asian Language Information Processing (TALIP), Vol. 3, No 4, pp 215-226, 2004. Han E., Karypis G., Kumar V., “Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification,” In
[15]
[16]
[17]
[18]
PAKDD '01:Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp 53-65, 2001. El Kourdi M., Bensaid A., Rachidi T., “Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm,” In Semitic '04: Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, pp 51-58, 2004. Barakat A., “Two-Sample Multivariate Test of Homogeneity Using Weighted Nearest Neighbours,” Mater thesis, Science Computational Mathematics, An- Najah National University, 2012. Singh S., Murthy H., Gonsalves T., “Feature Selection for Text Classification Based on Gini Coefficient of Inequality,” Workshop and Conference Proceedings, The Fourth Workshop on Feature Selection in Data Mining, Vol. 10, pp 76-85, 2010. Mesleh A., “Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System,” Journal of Computer Science, Vol.3, No 6, pp 430-435, 2007.
Haytham Al-sarrayrih received my BSc in computer science in 1991 from Yarmouk University, Jordan. After that I got my MSc. degree in the field of Management Information Systems in 2003 from Arab Academy for Banking and Financial Sciences, Jordan. I finished my PhD degree in 2009 in the field of Computer Information Systems from Arab Academy for Banking and Financial Sciences, Jordan. In 2010 I spent 10 months PostDoc at TU-Berlin, Germany. My research Interests are in Text Mining, Clustering, E-learning, and Natural language processing. I worked at Mutah University for 16 years in different positions, I started as an analyst/programmer and after that I worked as a director of computer center until 2010. In 2011, I worked as an Ass. Professor at faculty of information technology, Al-hussein Bin Talal University, Maan, Jordan. Now I have been working in Saudi Electronic University in Saudi Arabia since 2012. Lars Knipping is a professor of "New Media in Mathematics and Natural Sciences" and heads the e-learning group of the innoCampus centre at Technische Universität Berlin. He also acts as Scientific Director of "IBI gGmbH" (institute for education in the information society). He belongs to the board of editors of ITSE (International Journal of Interactive Technology and Smart Education) and the editorial team of iJET (International Journal of Emerging Technologies in Learning). Dr. Knipping is a member of the DIN-NIA 36 expert group that cooperates with ISO SC-36 in creating e-learning standards (among them ISO/IEC 19796). Before joining Technische Universität he worked as a scientific consultant in a research project for a state-funded TV broadcaster, "Sender Freies Berlin", followed by positions as researcher and instructor at the multimedia group at the computer science department of Freie Universität Berlin and as lecturer in International Media and Computing at the FHTW Berlin. Dr. Knipping received his Ph.D. degree in 2005, graduating with distinction, for his work on the EChalk system and holds M.Sc. degrees in both mathematics and computer science. Wail H. Khaled received his BSc and MSc degrees in computer information systems from Yarmouk University, Jordan, in 2008 and 2012, respectively. In 2013, he joined Saudi Electronic University, where he is currently working as a lecturer of Preparatory year. His
research interests include Knowledge-based Systems, Data Mining, and Text categorization.
12