Multi-Class Classification of Turkish Texts with

11 downloads 0 Views 204KB Size Report
best classification success was the Multinomial Naïve Bayes algorithm with a classification ... Keywords—Turkish text classification, multi-class text classification ...
Multi-Class Classification of Turkish Texts with Machine Learning Algorithms Fatih Gürcan Department of Computer Engineering Karadeniz Technical University Trabzon, Turkey [email protected] Abstract—The problem of text classification is the process of supervised assignment of text documents to one or more predefined categories or classes according to the content of the processed texts with natural language processing methods. Text classification applications are actively used in various fields such as categorization of social interactions, web pages and news texts, optimization of search engines, extracting information, and automatically processing e-mails. In this context, it is aimed to classify Turkish texts with methods based on supervised machine learning. In this context, the classification success of supervised learning models on Turkish texts was analyzed with different parameters. These models have been tested for classification of news texts on five predefined classes (economy, politics, sport, health, and technology) and the system was trained with different number of training documents and the classification process was carried out. In this context, the classification performances of Multinomial Naïve Bayes, Bernoulli Naïve Bayes, Support Vector Machine, K-Nearest Neighbor, and Decision Trees algorithms on Turkish news texts are compared and interpreted in the light of the results obtained with different parameters. As a result of the study, the procedure with the best classification success was the Multinomial Naïve Bayes algorithm with a classification success of about 90%. These results show that the Naïve Bayes probability model can be used as an effective classifier method in classifying Turkish texts compared to other methods. In this context, it is envisaged that the proposed methodology could be applied to Turkish texts on different web platforms (social networks, forums, communication networks, etc.) for different purposes. Keywords—Turkish text classification, multi-class text classification, supervised learning, machine learning algorithms.

I. INTRODUCTION In recent years, with the advent of social networks, there has been a massive explosion of text-centric data sharing on online platforms. Processing and categorization of textual data on online platforms where such large amounts of data are produced and shared is at the forefront of the fundamental issues of data mining [1]–[3]. The text classification problem can be defined as the assignment process of text documents to one or more predefined categories or classes based on the content of the processed texts using data mining methods and methodologies. Text classification applications are used effectively in many areas for different purposes [2], [4]. For example, classification of incoming messages as spam or not spam, categorization of online textual data according to themes, optimization of search engines, categorization of information, analysis of social networks, sentiment analysis, opinion mining, identification of social trends, document summarization, suggestion systems, author identification, language

978-1-5386-4184-2/18/$31.00 ©2018 IEEE

recognition tools, etc. are some of today's well-known applications of text classification [2], [4], [5]. Machine learning is a main technique of data analysis that computerizes analytical model building. Machine learning is closely associated to (and often overlaps with) computational statistics, which also focuses on predictionmaking through the use of computers and it is widely used in text mining applications. Machine learning algorithms are universally divided into two categories as supervised learning and unsupervised learning. Clustering and association problems are closely related to unsupervised learning algorithms. On the other hand, classification and regression problems are closely related to supervised learning algorithms [1], [5], [6]. In unsupervised learning models, data in different types such as text, image, and signal are initially preprocessed. After, those closely related to each other within these data are grouped under the previously undefined clusters [2]. In the clustering process, the similarity between the data in any cluster should be maximal, whereas the similarity between the clusters should be minimal [1]–[3]. In this approach, there are no predefined classes and no training process is used [2]. Various algorithms for clustering such as k-means, and feature extraction techniques such as principal component analysis, singular value decomposition, and independent component analysis are examples to unsupervised learning approaches [7]. In supervised learning models, the system is trained with a set of predefined sets of class [1], [6]. In this approach, the classes are predefined and the system is trained with a training set. At this period, the attributes of each predefined class for classification are automatically taught to the system. Finally, the test data requested to be classified are assigned to the predefined classes consistent with this learning and the classification process is completed [1], [6], [8]. The text classification operations carried out by machine learning algorithms is based on the supervised learning approach. The most significant step in this approach is to identify the classes precisely and accurately using training sets. Once the classes are identified, this approach is easier and more effective than unsupervised approach [1], [6], [8]. The most familiar ones in supervised learning algorithms are Naïve Bayes, Support Vector Machine, K-Nearest Neighbor, Decision Trees, Linear Regression, and Logistic Regression [6], [8]. Supervised machine learning algorithms are effectively used in the analysis and classification of texts shared on online platforms where textual data interaction is intense, such as news sites, online shopping sites, blogs, online

communities, and social networks [2], [4], [5], [9]. The majority of research and applications on textual data analysis is based on the identification of the main themes and topics contained in the texts and the classification of the texts according to these themes. Given the existing literature in this field, it appears that there are many studies conducted based on classification and regression analysis of the texts in different languages [4], [5]. The majority of text classification based studies have been conducted on English texts [2], [8]. On the other hand, the studies on texts in other languages such as Turkish [10], Arabic [11], German [12], Chinese [13], and Spanish [14] are relatively very limited. In this context, different studies based on text classification on Turkish texts have also been done. The size of the term space is very large compared to other languages because of the linguistic structure of the Turkish language. For this reason, Turkish texts require more extensive preprocessing operations. Especially, the stemming operation is one of the most difficult processes during the classification of Turkish texts. There are also classification-based studies on Turkish texts in limited numbers, and supervised machine learning approaches have been used in most of them [10], [15]–[17]. In this experimental study, it is aimed to classify Turkish news texts with supervised machine learning algorithms containing Multinomial Naïve Bayes (Multinomial NB), Bernoulli Naïve Bayes (Bernoulli NB), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and Decision Trees (J48). In this context, the classification success of these algorithms on the texts has been analyzed with different number of training documents and different parameters in the classification process for the five predefined classes (economy, politics, sport, health, and technology). As a result of the study, the algorithm that achieved the best classification success on the Turkish texts was the Multinomial NB with a success rate of approximately 90%. The findings show that the Naïve Bayes probability models containing Multinomial and Bernoulli approaches can be used as an effective classifier method for classifying Turkish texts in comparison to other supervised learning approaches. Besides, Naive Bayes probability models have been seen to more suitable approaches for Turkish texts in terms of applicability. It is envisaged that the proposed methodology could be implemented to diverse classification problems to be performed on Turkish texts on different web platforms (social networks, forums, blogs, etc.). II. RESEARCH METHODOLOGY The methodology of this study consists of four sequential stages in accordance with the characteristics of a classification problem. Firstly, the Turkish news texts published on the online news sites were taken and a data set containing about 3000 news texts was created. In the next stage, the texts in the dataset were preprocessed. The document-term matrix that characterizes the data set was then created. Finally, a text classification analysis was performed on this matrix using the five algorithms based on supervised machine learning. In light of the findings obtained with different input parameters, the classification performances of the five algorithms consisting of Multinomial NB, Bernoulli NB, SVM, KNN, and J48 were evaluated and compared.

A. Data Collection and Preprocessing The data set used in this study consists of the texts of Turkish news published on the online news sites. In this context, a data set was created with 3000 news texts gathered from two different online news sites [18], [19]. Five distinct classes were defined for the assignment of texts in the data set, consisting of economy, politics, sport, health, and technology. The number of news texts assigned to each class was 1000. After the dataset was created, preprocessing steps were performed on the data to improve the quality and accuracy of the analysis [15], [16]. In this stage, punctuations, numbers, web links were deleted from the texts. Stopwords that do not make sense alone (such as “you”, “with”, “for”, etc.) were also deleted from the texts. Turkish is an agglutinative language. Agglutination mentions to the process of adding suffixes to a root-word. Therefore, snowball stemmer algorithm was applied to words to stem them [20], [21]. In this way, it was provided that the texts are represented by a lower-dimensional word space. Textual data are qualitative due to their nature and must be converted into a numerical matrix arrangement for quantitative analysis. And so, the processed data was transformed into a document-term matrix required for the classification operations. In a document-term matrix, each row represents a news text, and each column represents a word in the data set. Finally, the term-weighting process was performed on the document-term matrix. Term frequency was used for the term-weighting task. B. Method and Analysis In this stage, an experimental analysis based on classification of the news texts was performed on the document-term matrix created in the previous period. In this text classification-based analysis, the supervised machine learning algorithms containing Multinomial NB, Bernoulli NB, SVM, KNN, and J48 were used. Of course, the most important task in the implementation of the supervised learning methods is the creation of manually pre-classified training set. The selection of the classification algorithms to realize the highest success and the determination of the optimal parameters for the algorithms are also significant processes. In supervised machine learning, the system is trained with a set of manually specified class labels, and the attributes associated with that class are learned by the classifier system. In the classification process based on the supervised learning, manually categorized data is separated into two groups as training set and test set. Thanks to training set, the attributes of each predefined class for classification are automatically taught to the system. The test set are then assigned to the predefined classes in accordance with this learning and the classification process is completed [5]. In this classification analysis, the classifier system was trained for the five predefined classes consisting of economy, politics, sport, health, and technology. In this context, the 2000 of the 3000 news texts in the dataset are used for the training set and the remaining 1000 are used for the test set. The classifier system was first trained with a training set containing 500 news texts and the 500 texts were added to the training set each time until reaching 2000 texts. In this way, the effect of the number of texts in the training set on

performances on Turkish texts were calculated and compared using different parameters. Besides, k-fold cross-validation technique was implemented to test the accuracy rate of the classifier system in terms of the classification algorithms as well as the predefined classes.

the classification success was measured. When separating the training set and the test set, the texts in the training set should not be included in the test set used to measure the success. If the same texts are included in both the training and the test set, the success of the classification cannot be truly calculated [2], [5], [22]. The creation of training and test sets is usually done randomly. In order for the success of the classifier system to be accurately measured, both the training set and the test set must have sufficient representation content.

A. Performance Evaluation of the Classification Algorithms Initially, the 2000 news texts covering five predefined classes in the dataset are used for the training set and the remaining 1000 are used for the test set. The classifier system was first trained with a training set containing 500 news texts and the 500 texts were added to the training set each time until reaching 2000 texts. In this way, each algorithm was trained with 4 different number of training sets (500, 1000, 1500, and 2000 documents) and the classifier system was implemented for each training set. The results obtained from this classification are given in Table 1. The classification performance and overall average performance of each algorithm according to the number of documents in the training set are shown in the table.

In this regard, k-fold cross-validation is an extensively used technique to test the accuracy rate of the statistical model that performs the classification. In this method, a hand-labeled total data set is separated into k groups that do not have a same element. In the first step, the first of these groups is selected as a test set and the remaining k-1 is used as a training set. Once the classification algorithm has been trained with this set of k-1 training groups, a performance on the test set is calculated. In the second step, the second group is selected as a test set. The rest of the k-1 group is set as a training set, the classifier system is trained using this set and its performance is calculated on the test set. In this way, after performing the training and testing in k steps, the average of the achievements is assigned as the last performance value. The K-fold cross-validation technique provides an analysisbased estimation for adjusting the set of training and test sets, especially in big data sets. In general, k = 10 or k = 5 values are frequently used for the cross-validation technique [22].

According to the findings in the table, as the number of documents in the training set increases, it was observed that the performance of all algorithms increases. In terms of classification performance, Multinomial NB algorithm is the classifier with a highest performance of approximately 90%. It is followed by the Bernoulli NB algorithm in the second order with a ratio of about 78%. The KNN algorithm (the number of nearest neighbors, k = 3) has been the classifier model having the lowest success performance in this analysis.

In the classification model implemented for this study, a cross validation method was applied for k = 5, and the achievements of the supervised machine learning algorithms containing Multinomial NB, Bernoulli NB, SVM, KNN, and J48 were calculated for this cross validation.

In addition, with the purpose of measuring and evaluating the accuracy of the classifier system, 5-fold cross validation was applied on the data set consisting of 3000 (the total of training and test sets) news texts. The classification performances of the algorithms obtained by the cross validation are given in Table 2. With the 5-fold crossvalidation, the Naive Bayes algorithm was confirmed to be the best classifier with a 95.20% classification success.

III. EXPERIMENTAL RESULTS As a result of this experimental analysis, Multinomial NB, Bernoulli NB, SVM, KNN, and J48 algorithms' classification

TABLE I. PERFORMANCES OF THE ALGORITHMS ACCORDING TO TRAINING SETS Algorithms

500 D.

1000 D.

1500 D.

2000 D.

Average

Multinomial NB

88.38%

89.32%

90.15%

91.17%

89.75%

Bernoulli NB

76.10%

77.18%

76.36%

80.70%

77.59%

SVM

65.12%

66.91%

73.29%

77.20%

70.63%

J48

53.63%

57.20%

63.33%

68.19%

60.59%

KNN

32.94%

34.48%

50.75%

57.20%

43.84%

TABLE II. EVALUATION OF THE CLASSIFICATION ALGORITHMS WITH CROSS VALIDATION Algorithms

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

Average

Multinomial NB

97.55%

94.49%

93.98%

95.00%

95.00%

95.20%

Bernoulli NB

99.60%

89.03%

91.42%

90.91%

92.10%

92.61%

SVM

92.45%

86.62%

86.04%

87.85%

85.60%

87.71%

J48

97.55%

78.96%

81.52%

82.23%

81.52%

84.36%

KNN

89.03%

77.00%

75.75%

76.41%

77.80%

79.20%

B. Performance Evaluation of the Predefined Classes At this stage of the analysis, the classification performances of the algorithms for each of the predefined classes was measured for 2000 training and 1000 test documents. The classification success of each algorithm based on the predefined classes and the average classification success of each class are given in Table 3 with percentage ratios. According to the results, the class with the highest classification performance has been the sport having an 88.56% classification success. On the other hand, the classes

with the lowest classification performance have been politics and economy. Since the classes of economy and politics are close to each other, the classification success in these classes is lower than the other classes. In addition, in order to verify the classification performance of the categories, 5-fold cross validation was applied on the entire data set consisting of 3000 news texts. The classification performances according to the categories are given in Table 4. Thus, the classification performance of the categories was verified by the cross-validation process.

TABLE III. CLASSIFICATION PERFORMANCES OF THE PREDEFINED CLASSES Algorithms

Sport

Health

Technology

Politics

Economy

Multinomial NB

91.94%

97.04%

83.00%

93.21%

81.59%

Bernoulli NB

97.04%

94.49%

81.21%

75.34%

62.06%

SVM

86.83%

94.49%

83.97%

65.12%

79.29%

J48

88.87%

74.06%

86.07%

68.95%

64.35%

KNN

78.14%

75.34%

72.02%

57.46%

71.25%

Average

88.56%

87.08%

81.25%

72.02%

71.71%

TABLE IV. PERFORMANCE EVALUATION OF THE PREDEFINED CLASSES WITH CROSS VALIDATION Algorithms

Sport

Health

Technology

Politics

Economy

Multinomial NB

96.14%

99.60%

95.34%

96.19%

91.08%

Bernoulli NB

96.19%

97.04%

98.75%

89.38%

84.27%

SVM

96.19%

95.34%

80.02%

87.68%

82.57%

J48

97.04%

93.64%

91.94%

79.17%

78.32%

KNN

96.19%

71.51%

86.83%

71.51%

79.17%

Average

96.35%

91.42%

90.57%

84.78%

83.08%

REFERENCES

IV. CONCLUSION In this study, classification performances of the supervised machine algorithms consisting of Multinomial NB, Bernoulli NB, SVM, KNN, and J48 on Turkish news texts were calculated and compared using different parameters. Moreover, k-fold cross validation (for k=5) technique was implemented to assessment the accuracy rates of the classifier system in terms of the performances of both classification algorithms and predefined classes. As a result of the study, Multinomial NB algorithm became the best classifier with a classification success of approximately 90%. On the other hand, KNN algorithm was the classifier showing the lowest performance in this analysis. This study also indicated that the Naïve Bayes probability models covering Multinomial and Bernoulli approaches are the best two classifiers for Turkish texts. Besides, it was seen that the increase in the number of documents in the training set increases the classification performance of each algorithm implemented in this study. In this regard, the methodology and findings of the study may provide valuable contributions to diverse text classification research to be performed on Turkish texts on different web platforms (social networks, forums, blogs, etc.).

[1]

F. Sebastiani, “Machine learning in automated text categorization,” ACM Comput. Surv., 2002. [2] A. N. Srivastava and M. Sahami, Text mining: Classification, clustering, and applications. CRC Press, 2009. [3] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell, “Text classification from labeled and unlabeled documents using EM,” Mach. Learn., 2000. [4] V. Gupta and G. S. Lehal, “A survey of text mining techniques and applications,” Journal of Emerging Technologies in Web Intelligence. 2009. [5] C. C. Aggarwal and C. Zhai, Mining text data. 2013. [6] S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas, “Machine learning: A review of classification and combining techniques,” Artif. Intell. Rev., 2006. [7] M. Kantardzic, Data Mining: Concepts, Models, Methods, and Algorithms: Second Edition. 2011. [8] M. Ikonomakis, S. Kotsiantis, and V. Tampakas, “Text classification using machine learning techniques,” WSEAS Trans. Comput., 2005. [9] X. Hu and H. Liu, “Text analytics in social media,” in Mining Text Data, 2012. [10] M. F. Amasyali and B. Diri, “Automatic Turkish text categorization in terms of author, genre and gender,” Nat. Lang. Process. Inf. Syst. Proc., 2006. [11] S. Al-Harbi, A. Almuhareb, A. Al-Thubaity, M. S. Khorsheed, and A. Al-Rajeh, “Automatic Arabic Text Classification,” Text, 2008. [12] M. Scharkow, “Thematic content analysis using supervised machine learning: An empirical evaluation using German online news,” Qual. Quant., 2011.

[13] S. H. Lu, D. A. Chiang, H. C. Keh, and H. H. Huang, “Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values,” KnowledgeBased Syst., 2010. [14] M. del Pilar Salas-Zarate, M. A. Paredes-Valverde, J. Limon, D. A. Tlapa, and Y. A. Báez, “Sentiment Classification of Spanish Reviews: An Approach based on Feature Selection and Machine Learning Methods.,” J. UCS, vol. 22, no. 5, pp. 691–708, 2016. [15] D. Torunoǧlu, E. Çakirman, M. C. Ganiz, S. Akyokuş, and M. Z. Gürbüz, “Analysis of preprocessing methods on classification of Turkish texts,” in INISTA 2011 - 2011 International Symposium on INnovations in Intelligent SysTems and Applications, 2011. [16] A. K. Uysal and S. Gunal, “The impact of preprocessing on text classification,” Inf. Process. Manag., vol. 50, no. 1, pp. 104–112, Jan. 2014.

[17] M. F. Amasyali and T. Yildirim, “Automatic text categorization of news articles,” in Proceedings of the IEEE 12th Signal Processing and Communications Applications Conference, SIU 2004, 2004. [18] “Hürriyet - Haberler, Son Dakika Haberleri ve Güncel Haber.” [Online]. Available: http://www.hurriyet.com.tr/. [Accessed: 12-Jul2018]. [19] “NTV HABER - Haberler, Son Dakika Haberleri.” [Online]. Available: https://www.ntv.com.tr/. [Accessed: 15-Jul-2018]. [20] M. Porter, “Snowball: A language for stemming algorithms,” Snowball. 2001. [21] Anjali, G. Jivani, and M. Anjali, “A Comparative Study of Stemming Algorithms,” October, 2007. [22] Y. Bengio and Y. Grandvalet, “No unbiased estimator of the variancee of k-fold cross-valudation,” J Mach Learn Res, 2004.

Suggest Documents