Text Classification Experiments on Mongolian

Text Classification Experiments on Mongolian Language Zolboo Damiran*, Khuder Altangerelt

MUST-CSMSlDepartment oflnformation System, Ulaanbaatar, Mongolia MUST-CSMSlDepartment of Programming Technology, Ulaanbaatar, Mongolia [email protected], [email protected] *

t

Abstract-Text classification is an important task of assigning obj ects from a universe to two or more classes,

new.! . .......

.,""""

especially classification is to classify the topic or the the result of an experiment on a text classification using Bayesian

method

approaches

for

Mongolian

language. We have training set of documents, each

labeled with one of 10 classes. Since it is the first work in

this field for Mongolian language no previous work

'I'J.rYJ.W

__

�3 lUll...

�

I�_.

UlH_

II

'*""....

....-

Fig. 2. News Agency LLC

theme of a document. In this research, we have analyzed Naive

..

"

�__

�JSHUUD.mn HYYP1I:'f'Illo'ol-

A.lr"'�

'III( 101'

...'lAUol HIIiII.IH

JJ

UIOPI

mn."./W

..,."., me""

)P'I'IlI"-�

...

l tiOHJI�

t>YU.A

Fig. 3. Shuud.mn LLC

results were available for comparison. Keywords - Text categorization, classification, Naive Bayesian method, corpus

I. INTRODUCTION Text classification is a problem applied to natural language texts that assigns in a document to one or more predefmed categories, based on its content. With the development of the Internet, web applications such as news portal websites and information search are applied widely. The consequence is that the information databases (in other hand news) on portal web site are expanding rapidly. Nowadays, it is a major challenge in information technology to find quickly the correct information that the user need from the mass information. Tn many contexts (News.mn, Gogo.mn, Shuud.mn), trained professionals are employed to categorize new items. This process is very time-consuming and costly, thus limiting its applicability. Consequently there is increased interest in developing technologies for automatic text categorization. Nowadays, the mainly common text classification methods are the Bayes Classification algorithm, Decision Trees, Maximum Entropy Modeling, Perceptron Learning Algorithm and k-Nearest Neighbor Classification. Tn this paper, we describe the result of an experiment on a text classification using NaiVe Bayesian method approaches for Mongolian language. We chose our classes on 3 web portals (shown in Fig. I-3), which are commonly used in Mongolia. . . ....'"

'.....

_

..

.

: .: gogo 1MM»�I�uyc_C¥6�U"""�� .

........... .............

"_fr"'

........

,1

""."....

..

.. ......

Fig. 1. Mongol Content LLC

___

"

.., ....,..

I ..., ....

u..t

...

_""

�

We chose to make the detection on the 10 commonly used topics in those web portals: 1. Education 2. Criminal 3. World 4. Social 5. Culture 6. Sport 7. Politic 8. Yellow 9. Economic 10. Health We chose our trammg corpora based on http://news.gogo.mnJ web portal. Using text classification in news topic, we can decide the following: Deciding which is the topic of a news article from a fixed list of topic areas such as "sports" "health" and "politics". Which word gives us the most informative features. Since it is the first work in this field for Mongolian language no previous work results were available for comparison. IT. BAYESTAN TEXT CLASSTFTCATION METHODS The NaIve Bayesian method is a simple probabilistic classifier based on Bayes Theorem (from Bayesian statistics) with strong (naive) independence assumptions which assumes that all the features are mutually independent.

Tn the NaiVe Bayes classifiers, every feature can help detennining which topic should be assigned to a given input value. To choose a topic for an input value, the naive Bayes classifier begins by calculating the prior probability of each topic, which is detennined by checking the frequency of each topic in the training set. The contribution from each feature is then combined with this prior probability, to arrive at a likelihood estimate for each topic. The topic whose likelihood estimate is the highest is then assigned to the input value. A classifier is called supervised if it is built based on training corpora containing the correct topic for each input. The framework used by Bayesian classification is shown in FigA. (label topic) =

(8) Training labell----.j

machine

learning

r---qJ::p:rn--l algorithm features

(b) PredlcUon

.--�

features Fig.4. Bayesian Classification

(a) During the training, a feature extractor is used to convert each input value to a feature set. These feature sets, which capture the basic infonnation about each input that should be used to classifY it, are discussed in the next section. Pairs of feature sets and topics are fed into the machine learning algorithm to generate a model. (b) During the prediction, the same feature extractor is used to convert unseen inputs to feature sets. These feature sets are then fed into the model, which generates predicted topics. Our NaiVe Bayesian algorithm: TRAINMULTINOMlALNB ( C,lD )

1 2 3 4 5 6 7 8 9 10

11

V

Text Classification Experiments on Mongolian

Text Classification Experiments on Mongolian

Suggest Documents

Text S1: Details of classification experiments - PLOS

Text S1: Details of classification experiments - PLOS

Text Classification based on Associative

A modular system for support of experiments in text classification

Mongolian Rotifers on Microscope Slides

Text Classification on a Grid Environment - VECPAR

On Dataless Hierarchical Text Classification - Cognitive Computation ...

Persian Text Classification based on Topic Models

Influence of Word Normalization on Text Classification - Text-Mining

Text Classification of Technical Papers Based on Text Segmentation ...

mongolian - jargaldefacto

Non-text Classification in

Hierarchical Text Classification - CiteSeerX

Mongolian Journal of Chemistry Investigation on ...

Text Classification of Formatted Text Documents - CiteSeerX

Monitoring Change On Mongolian Rangelands - CiteSeerX

AGRICULTURE ON THE MONGOLIAN STEPPE Doeke Eisma

Experiments in Multimodality Image Classification and ... - CiteSeerX

Experiments with multi-label text classifier on the

Experiments on Proximity Based Chinese Text Retrieval in ... - CiteSeerX

Text classification method review - Core

Text Classification with Compression Algorithms

Text Processing for Classification - CiteSeerX

Text Classification using Naive Bayes