A Survey On Text Categorization Of Indian And Non ...

A Survey On Text Categorization Of Indian And Non-Indian Languages Using Supervised Learning Techniques Khyati S. Kava

Prof. Nikita P. Desai

Faculty of Technology Dharmsinh Desai University Nadiad, India [email protected]

Faculty of Technology Dharmsinh Desai University Nadiad, India [email protected]

Abstract—Categorization of text plays an important role in the text mining field. Text categorization is the process in which documents are categorized into its predefined category. Automatic text categorization is an important task due to large amount of electronic documents. This paper presents a survey of Text categorization of Indian and non-Indian languages. There is very less work done in text categorization of Indian languages. To extract the features of documents, mostly TF-IDF (Term frequency-Inverse document frequency) method is used. Major classifiers such as SVM (support vector machine), NB (Naïve Bayes), Decision tree and K-NN (K-Nearest neighbor) are used for text categorization process. Measures used to evaluate performance of text categorization are recall, precision and fmeasure. Keywords-Text Categorization, TF-IDF, SVM, NB.

I. INTRODUCTION Process of text categorization is differentiating, recognizing and understanding the objects and ideas and group them into categories [22]. Text categorization process deals with large amount of electronic data such as blogs, newspaper, emails, online books etc. Nowadays managing electronic data are very challenging task as they increase day by day. Therefore, manually categorizing millions of documents is time consuming and expensive process. Thus, Automatic text categorization is an important task due to large amount of electronic documents [13]. Automatic text categorization has many applications including spam filtering, Text filtering, Email routing, Language identification, News classification, Contextual search and genre classification. Lots of work has been done for the Text categorization of non-Indian language, but very less attention is paid to Indian languages for categorization. Indian languages are morphologically rich. It has very large number of word forms and hence very large feature spaces and it also requires a set of pre-processing steps [9]. This paper is organized as follows. Section II: Background Theory, Section III: Literature Survey, Section IV: Issues in Indian languages of text categorization, Section V: Conclusion.

II. BACKGROUND THEORY There are mainly two types of categorization approaches. Rule-based approach and Machine Learning approach. In Rule-based approach, classification rules are developed manually and the documents are classified based on those rules. In Machine learning approach, sample labeled documents are used to define equations automatically [12]. Training set and testing set are required in machine learning methods. Text categorization is based on machine learning methods. For training set, set of documents are tagged manually by the experts. The performance of the system depends on good training set. Machine learning approach to the text categorization is based on bag of words. The use of concepts for text categorization increases its overall performance specifically when considering categorization of domain specific corpus [15]. Collection of documents is divided into two sets: Training set and Test set, Where training set is a pre-classified set of documents which are used for training the classifier, and the testing set is to determine the accuracy of the classifier whether the set is having correct or incorrect [12]. A. Text categorization process Text categorization is the process of automatically sorting a set of documents into its pre defined categories. Architecture of text categorization is shown in figure [1]. This process goes through mainly three steps: Data pre-processing, classification construction and performance evaluation. 1) Data pre-processing Pre-processing phase includes word tokenization, stop words elimination and stemming.

a) Tokenization Tokenization is the process in which a document is divided into small units called tokens. b) Stop word elimination A stop list is a list of unwanted words such as prepositions, conjunctions and pronouns, etc. which are repeated in nearly every text document [12].

4) Performance evaluation The effectiveness of a text Categorization can be evaluated in terms of its precision (p), recall (r) and F-measure. Precision is the percentage of correctly classified documents among all documents that were assigned to the category by the classifier i.e. measure of exactness. A recall for a category is defined as the percentage of correctly classified documents among all documents belonging to that category i.e. measure of completeness. The f-measure combines the two measures recall and precision [10]. III.

Figure 1 Architecture of Text Categorization [2]

c) stemming Removing affix, prefix, suffix from words and converting it into root words is called stemming. 2) Feature extraction In text categorization features are extracting using TF-IDF (term frequency-inverse document frequency). Term frequency is the number of times word w i occurs in document D. Term frequency can be defined as TF = (w i, D). Document frequency DF (wi) is the number of document in which word wi occurs at least once. Inverse document frequency is the ratio of total number of documents upon number of documents in which term t occurs and takes logarithm of that quotient. IDF = log

N n

where N is documents and n is number of

documents which contain term t.

3) Classifiers In the literature for Text Categorization, there are some major classifiers used for text categorization process such as Naïve Bayes model, Support vector machines, decision trees, K- nearest neighbor, and Neural Networks etc. As per survey Support Vector Machines (SVM) is a better performance classifier than other classifiers due to its ability to efficiently handle relatively high dimensional and large-scale data sets without decreasing classification accuracy. Other classifiers like K-nearest neighbor (K-NN) is very simple and effective but not efficient in the case of high dimensional and largescale data sets, K-NN makes prediction based on the “k” training documents which are closest to the test document. The Naive Bayes (NB) classifiers assume that the terms in one document are independent even this is not the case in the real world [12].

LITERATURE SURVEY

A. Text categorization of non-Indian languages Meryeme hadni et al. [2] proposed Arabic based stemmer. They have used three stemming algorithms root-based approach, stem-based approach and statistical approach for pre-processing. For feature selection method they have used TF-IDF. They have mentioned categories education, science and tourist. As a dataset they have used large Arabic corpus. They have used major classifiers like SVM and naïve Bayes classifiers. For calculating the accuracy, they have used performance evaluation measures precision, recall and fmeasure. Using naïve Bayes classifier they have reported results 71.16% for Hybrid Stemmer, 66% for Light Stemmer and 50.33 for N-Gram. Using SVM they have reported results 63.2% to Light Stemmer, 51% for N-Gram and 63.2% for Khoja Stemmer. Saleh Alsaleem [4] proposed automated Arabic language text categorization. As corpus they have used online Arabic news paper. For feature selection method they have considered keywords (bag-of-words). They have used major classifiers SVM and Naive Bayes. Accuracy of these classifiers is 77% for SVM and 74% for naïve Bayes. M. F. Amasyalı and B. Diri [6] proposed Automatic Turkish Text Categorization in Terms of Author, Genre and Gender. They have used online Turkish news as corpus. As a classifier they have used SVM, Naïve Bayes and K-NN. They have classified Turkish documents according to author, genre and gender. They have mentioned Accuracy in terms of categories like Author- 83%, Genre – 93%, Gender – 96%. Ji he et al. [8] presented study on Chinese language text categorization method. They have created own corpus on Chinese language. They have used classifier as K-NN, SVM and ARAM. As a feature they have used bag of words. Text Classification Using Machine Learning Techniques was proposed by M.iskonomiskis et al. [7]. They have used features as bag-of-words and as corpus they have used online news papers. They have used Classifiers such as K-NN, SVM and Naïve Bayes. “A Text Classification using Keyword Extraction Technique” was proposed by Menaka, S. Radha [1]. In this paper they have used classifiers such as K-NN 95%, Naive Bayes 87%, and Decision tree 98%. Bag of words are used as features.

“Automatic Text Classification: A Technical Review” is presented by Mita K. Dalal, Mukesh A. Zaveri et al. [5]. In this paper, they have used features as bag-of-words. And as a corpus they have used online news papers. They have used classifiers such as SVM, Decision tree, naïve Bayes, and neural network. Pratiksha Y. Pawar and S. H. Gawande [3] proposed Approaches to Text Categorization. As corpus they have used 20Newsgroup and Reuters 21578. They have presented a comparative study on different types of approaches to text categorization. They have used Different classifiers such as KNN, Rocchio’s Algorithm, Decision Tree, Naïve Bayes, Back propagation Network, Support Vector Machines (SVM). They have represented comparative results among different classifiers. Using 20Newsgroup corpus their reported average results are K-NN - 81.04%, Rocchio’s - 79.10%, Naïve Bayes - 83%, SVM - 86.10%. Using Reuters 21578 corpus their reported results are K-NN - 79.07%, L Square - 94.57%, SVM (Polynomial) - 86.20%, SVM (rbf) - 86.50%, K-NN – 80.03%, Decision Tree – 87.09%, SVM- 91.05%. El Kourdi, Mohamed et al. [14] proposed Automatic Arabic document categorization based on the Naïve Bayes algorithm. They have performed automatic classification of Arabic web documents. TF-IDF is used for feature extraction. As a classifier they have used Naïve Bayes. They considered 300 documents under categories of health, sport, business, science and culture. B. Text categorization of Indian language M Narayana Swamy et al. [10] have proposed text categorization technique. They have classified the documents into three Indian languages are Kannada, Telugu, and Tamil.

They have used features as TF-IDF. They have mentioned their database of 300 documents. Reported accuracy are Naïve Bayes (97.66%), k-nearest neighbor (93%), and decision tree (97.33%). A. R. Ali and M. Ijaz [11] proposed Urdu text classification. They have used large text corpus of 19.3 million words, collected from different online Urdu news websites. As a feature extraction they have used bag of words. They have mentioned categories News, Sports, Finance, Culture, Consumer information, Personal communication. K Raghuveer and et al. [9] proposed Text categorization approaches. In his paper they have defined many Indian languages like Assamese, Bengali, Hindi, Kannada, Oriya, Gujarati, Punjabi, Telugu etc. They have used TF-IDF as feature vector. They have used DoE-CIIL corpora and 262 documents from these corpora. They have used three classifiers K-NN, Naïve Bayes, and SVM. They have calculated Performance evaluation for all languages and reported Precision-86.70%, Recall-53.29% and F-measure66.01%. Nidhi, Vishal Gupta [13] have used Punjabi language text documents. As a feature selection method they have used TFIDF. They have used Punjabi news paper as a corpus. And as classifier they have described Naïve Bayes and Centroid based classifiers. Vishnu murthy.g et al. [12] proposed an automated Telugu language text categorization with effective classifiers. They have defined some categories as Economics, Politics, Science, Sports, Culture and Health. They have created own Telugu corpus and they have used features as a TF-IDF. As classifier they have used Naïve Bayes, K-NN and SVM.

Table 1 literature review of text categorization

C. Non-Indian Language Sr no.

1

Papers name and year

Publication details and Author names

Effective Arabic stemmer based hybrid approach for Arabic text categorization

International Journal of Data Mining & Knowledge Management Process

(2013)

2

Automated Arabic Text Categorization Using SVM and NB. (2011)

Feature selection method and Target Languages TF-IDF

Classifier and average accuracy

Corpus sources And Size of corpus

SVM Naïve Bayes

Create own corpus

SVM – 77%

Online news papers

Arabic

(Meryeme hadni et al.)

International Arab Journal of eTechnology.

keywords

Naïve Bayes – 74% Arabic

(Saleh Alsaleem )

Characteristics

They have used three stemming algorithm for pre-processing steps. 1. root-based approach 2. stem-based approach 3. Statistical approach. Categories are education, science and tourist. As a feature they have considered keywords. They have used classifiers like SVM and Naive bayes. Categories types are not defined.

3

4

5

6

Automatic Turkish Text Categorization in Terms of Author , Genre and Gender ( 2006)

Springer

Key words

(M. F. Amasyalı and B. Diri)

Turkish

A comparative study on Chinese text categorization method. (2000) Text Classification Using Machine Learning Techniques. (2005)

(Ji he, ah-hwee, chew- linn tan)

Key words

(M.iskonomiskis S. kotsiantis V.tampakas )

Text Classification using Keyword Extraction Technique.

International Journal of Advanced Research in Computer Science and Software Engineering.

(2013)

7

8

9

Automatic Text Classification: A Technical Review. (2011)

A Comparative Study on Different Types of Approaches to Text Categorization (2012)

Automatic Arabic document categorization based on the Naïve Bayes algorithm. (2004)

SVM Naïve –Bayes K-NN C4.5 Accuracy mentioned in terms of categories. Author- 83% Genre – 93% Gender – 96% K-NN SVM ARAM

Online Turkish news paper

They have classified Turkish documents according to author, genre and gender.

Create own corpus

Types of categories are not defined.

Bag of words

K-NN, SVM, Naïve Bayes

Online news papers

Keywords

K-NN – 95%

Not mention

They have used features as bag-of-words and as corpus they used online news papers. They used Classifiers like K-NN, SVM and Naïve Bayes. Categories are not mentioned

Chinese

Naive Bayes – 87% Decision tree – 98%

(Menaka, S. Radha ) International Journal of Computer Applications,

Bag of word

Neural network, SVM, Naïve -Bayes, Decision tree

Online news papers

Categories are not mentioned.

(Mita K. Dalal, Mukesh A. Zaveri) International Journal of Machine Learning and Computing.

Bag-of-words

Using 20Newsgroup corpus K-NN – 81.04% Rocchio’s – 79.10% Decision - Tree, Naïve Bayes – 83% SVM – 86.10% Naïve Bayes90%

20 Newsgroup And Reuters 21578

They performed different machine learning techniques.

Online news paper 300 documents for each category

Categories are sport, business, health, science and culture.

(Pratiksha Y. Pawar and S. H. Gawande)

(El Kourdi, Mohamed, Amine Bensaid, and Tajjeeddine Rachidi)

TF-IDF Arabic

D. Indian languages Sr no.

10

Paper name and year

Publication details and Author names

Indian Language Text Representation and Categorization Using Supervised Learning Algorithm (2013)

International Journal of Data Mining Techniques and Applications. (M Narayana Swamy, M. Hanumanthappa)

Feature selection method and Target Languages Bag of words

Kannada, Telugu, Tamil

Classifier and average accuracy

Corpus sources and size

K-NN – 93%

Self created,

Decision tree – 97.33%

300 documents

Naïve Bayes97.66%

Characteristics

In this paper they have classified the documents based on Kannada, Tamil and Telugu languages.

11

12

A Comparative study on term weighting methods for Automated TELUGU text categorization with effective classifiers. (2013)

International Journal of Data Mining & Knowledge Management Process.

Urdu text classification

International Journal

Key words

(A. R. Ali and M. Ijaz)

Urdu

Text Categorization in Indian Languages using Machine Learning Approaches (2007)

(K Raghuveer and Kavi Narayana Murthy )

TF-IDF

Domain Based Classification of Punjabi Text Documents using Ontology and Hybrid Based Approach.

(Nidhi, Vishal Gupta)

(2009)

13

14

Bag of word Telugu

SVM Naïve Bayes K-NN

In this paper they used domains such as Economic, politics, science, sports, culture health.

(Vishnu Murthy.G Et al. )

Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Oriya, Punjabi, Tamil and Telugu

SVM, Naïve Bayes

Large text corpus of 19.3 million words, collected from different online Urdu news websites

They have mentioned following categories: News, Sports, Finance, Culture, Consumer information, Personal communication.

K-NN Naïve Bayes SVM

DoE-CIIL corpora

They have used following categories: Aesthetics, Social-Sciences, Natural Sciences, Translated Material, official and Media Language, Commerce.

TF-IDF

Naïve Bayes ,

Punjabi

Centroid Based

V. CONCLUSION In this paper, we have described text categorization of Indian and non Indian languages with its performance. There is not much work done in text categorization in Indian languages. Text categorization of Indian languages is a challenging task due to fewer resources availability; also they are rich in morphology and hence give rise to a very large number of word forms. From the survey we can conclude that TF-IDF is majorly used for feature extraction in text categorization as TF-IDF is most common word weighting method. For classifiers we can conclude that major classifiers used are SVM and Naïve Bayes. Our future work is to perform text categorization of Gujarati documents which is one of the major Indian languages.

[4]

[5]

[6]

[7]

[8]

[9] [10]

[11]

[12]

References Menaka, S. Radha, “Text Classification using Keyword Extraction Technique,” vol. 3, no. 12, pp. 734–740, 2013. Hadni, Meryeme, Said Alaoui Ouatik, and Abdelmonaime Lachkar. "effective arabic stemmer based hybrid approach for arabic text categorization." International Journal of Data Mining & Knowledge Management Process 3.4 (2013).

262 documents

Average60.30%

[3]

[2]

800 news articles

Average- 82.7%

IV. ISSUES IN INDIAN LANGUAGES OF TEXT CATEGORIZATION Indian languages are morphologically rich and agglutinative in nature, words of which are inflected with various grammatical features Therefore there is not much work done in text categorization in Indian languages. Moreover another challenge is very less resource availability in Indian languages. Also for pre processing such as stop words removal and stemming not many efficient methods or tools are available for them.

[1]

Telugu news paper

[13]

[14]

Punjabi news papers

Categories are not mentioned

Pawar, Pratiksha Y., and S. H. Gawande. "A comparative study on different types of approaches to text categorization." International Journal of Machine Learning and Computing 2.4 (2012): 423-426. Alsaleem, Saleh. "Automated Arabic Text Categorization Using SVM and NB."Int. Arab J. e-Technol. 2.2 (2011): 124-128. International Arab Journal of e-Technology, Vol. 2, No. 2, June 2011. Dalal, Mita K., and Mukesh A. Zaveri. "Automatic text classification: A technical review." International Journal of Computer Applications 28.2 (2011): 37-40. Amasyalı, M. Fatih, and Banu Diri. "Automatic Turkish text categorization in terms of author, genre and gender." Natural Language Processing and Information Systems. Springer Berlin Heidelberg, 2006. 221-226. Ikonomakis, M., S. Kotsiantis, and V. Tampakas. "Text classification using machine learning techniques." WSEAS Transactions on Computers 4.8 (2005): 966-974. He, Ji, Ah-Hwee Tan, and Chew Lim Tan. "A Comparative Study on Chinese Text Categorization Methods." PRICAI Workshop on Text and Web Mining. Vol. 35. 2000. Raghuveer, K., and Kavi Narayana Murthy. "Text Categorization in Indian Languages using Machine Learning Approaches." IICAI. 2007. Swamy, M. Narayana, and M. Hanumanthappa. "Indian Language Text Representation and Categorization Using Supervised Learning Algorithm." International Journal of Data Mining Techniques and Applications Vol:02, December 2013, Pages: 251-257. Ali, Abbas Raza, and Maliha Ijaz. "Urdu text classification." Proceedings of the 7th International Conference on Frontiers of Information Technology. ACM, 2009. Pal Reddy, P. Vijay. "A comparative study on term weighting methods for automated Telugu text categorization with effective classifiers." International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.3, No.6, November 2013. Nidhi, Vishal Gupta. "Domain Based Classification of Punjabi Text Documents using Ontology and Hybrid Based Approach." Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing, COLING. 2012. El Kourdi, Mohamed, Amine Bensaid, and Tajje-eddine Rachidi. "Automatic Arabic document categorization based on the Naïve Bayes algorithm."Proceedings of the Workshop on Computational Approaches

[15]

[16] [17]

[18]

to Arabic Script-based Languages. Association for Computational Linguistics, 2004. Pushpa, M., and K. Nirmala. "Text Categorization Using Activation Based Term Set." International Journal of Computer Science Issues (IJCSI) 9.4 (2012). C. C. Aggarwal and C. Zhai, “A Survey Of Text Classification Algorithms”. C. Chan, “Automated Online News Classification with Personalization,” pp. 1–10, 2001. D. Ramdass and S. Seshasai, “Document Classification for Newspaper Articles,” pp. 1–12, 2009. K. Nirmala, “Feature based Text Classification using Application Term Set,” vol. 52, no. 10, pp. 1–3, 2012.

[19] W. B. Cavnar and J. M. Trenkle, “N-Gram-Based Text Categorization,” 1994. [20] Niharika, S., V. Sneha Latha, and D. R. Lavanya. "A Survey on Text Categorization." International Journal of Computer Trends and Technologyvolume3Issue1-2012 (2012). [21] Soucy, Pascal, and Guy W. Mineau. "Beyond TFIDF weighting for text categorization in the vector space model." IJCAI. Vol. 5. 2005. [22] http://en.wikipedia.org/wiki/Categorization#cite_note-1.

A Survey On Text Categorization Of Indian And Non ...

A Survey On Text Categorization Of Indian And Non ...

Suggest Documents

On-line Handwritten Text Categorization

Arabic Text Categorization

Political Text Categorization

Cross-Lingual Text Categorization

Simultaneous Categorization of Text Documents and ... - CiteSeerX

Automatic Text Categorization based on Hierarchical ...

A Survey and Categorization of Small Low-Cost ... - Semantic Scholar

Context Based Object Categorization: A Critical Survey

1 A Pattern-based Survey and Categorization of ...

A Survey and Categorization of Small Low-Cost Unmanned Aerial ...

A Hybrid User Model in Text Categorization

Text Categorization Using an Ensemble Classifier Based on a Mean ...

Text Categorization Using an Ensemble Classifier Based on a Mean

EXPERIMENT WITH A HIERARCHICAL TEXT CATEGORIZATION ...

A Framework for Text Categorization - Semantic Scholar

A Meta-Learning Approach for Text Categorization

A Structure-sensitive Framework for Text Categorization

A Comparison of Word-and Sense-based Text Categorization Using

Security and Privacy Policy Languages: A Survey, Categorization and ...

Text Categorization using bibliographic records

Biomedical Text Categorization Based on Ensemble Pruning and ...

Text Categorization, Supervised Learning, and Domain Knowledge ...

Using WordNet for Text Categorization

Automatic Text Categorization - Google Sites