Text Classification Experiments on Mongolian Language Zolboo Damiran*, Khuder Altangerelt
MUST-CSMSlDepartment oflnformation System, Ulaanbaatar, Mongolia MUST-CSMSlDepartment of Programming Technology, Ulaanbaatar, Mongolia
[email protected],
[email protected] *
t
Abstract-Text classification is an important task of assigning obj ects from a universe to two or more classes,
new.! . .......
.,""""
especially classification is to classify the topic or the the result of an experiment on a text classification using Bayesian
method
approaches
for
Mongolian
language. We have training set of documents, each
labeled with one of 10 classes. Since it is the first work in
this field for Mongolian language no previous work
'I'J.rYJ.W
__
�3 lUll...
�
I�_.
UlH_
II
'*""....
....-
Fig. 2. News Agency LLC
theme of a document. In this research, we have analyzed Naive
..
"
�__
�JSHUUD.mn HYYP1I:'f'Illo'ol-
A.lr"'�
'III( 101'
...'lAUol HIIiII.IH
JJ
UIOPI
mn."./W
..,."., me""
)P'I'IlI"-�
...
l tiOHJI�
t>YU.A
Fig. 3. Shuud.mn LLC
results were available for comparison. Keywords - Text categorization, classification, Naive Bayesian method, corpus
I. INTRODUCTION Text classification is a problem applied to natural language texts that assigns in a document to one or more predefmed categories, based on its content. With the development of the Internet, web applications such as news portal websites and information search are applied widely. The consequence is that the information databases (in other hand news) on portal web site are expanding rapidly. Nowadays, it is a major challenge in information technology to find quickly the correct information that the user need from the mass information. Tn many contexts (News.mn, Gogo.mn, Shuud.mn), trained professionals are employed to categorize new items. This process is very time-consuming and costly, thus limiting its applicability. Consequently there is increased interest in developing technologies for automatic text categorization. Nowadays, the mainly common text classification methods are the Bayes Classification algorithm, Decision Trees, Maximum Entropy Modeling, Perceptron Learning Algorithm and k-Nearest Neighbor Classification. Tn this paper, we describe the result of an experiment on a text classification using NaiVe Bayesian method approaches for Mongolian language. We chose our classes on 3 web portals (shown in Fig. I-3), which are commonly used in Mongolia. . . ....'"
'.....
_
..
.
: .: gogo 1MM»�I�uyc_C¥6�U"""�� .
........... .............
"_fr"'
........
,1
""."....
..
.. ......
Fig. 1. Mongol Content LLC
___
"
.., ....,..
I ..., ....
u..t
...
_""
�
We chose to make the detection on the 10 commonly used topics in those web portals: 1. Education 2. Criminal 3. World 4. Social 5. Culture 6. Sport 7. Politic 8. Yellow 9. Economic 10. Health We chose our trammg corpora based on http://news.gogo.mnJ web portal. Using text classification in news topic, we can decide the following: Deciding which is the topic of a news article from a fixed list of topic areas such as "sports" "health" and "politics". Which word gives us the most informative features. Since it is the first work in this field for Mongolian language no previous work results were available for comparison. IT. BAYESTAN TEXT CLASSTFTCATION METHODS The NaIve Bayesian method is a simple probabilistic classifier based on Bayes Theorem (from Bayesian statistics) with strong (naive) independence assumptions which assumes that all the features are mutually independent.
Tn the NaiVe Bayes classifiers, every feature can help detennining which topic should be assigned to a given input value. To choose a topic for an input value, the naive Bayes classifier begins by calculating the prior probability of each topic, which is detennined by checking the frequency of each topic in the training set. The contribution from each feature is then combined with this prior probability, to arrive at a likelihood estimate for each topic. The topic whose likelihood estimate is the highest is then assigned to the input value. A classifier is called supervised if it is built based on training corpora containing the correct topic for each input. The framework used by Bayesian classification is shown in FigA. (label topic) =
(8) Training labell----.j
machine
learning
r---qJ::p:rn--l algorithm features
(b) PredlcUon
.--�
features Fig.4. Bayesian Classification
(a) During the training, a feature extractor is used to convert each input value to a feature set. These feature sets, which capture the basic infonnation about each input that should be used to classifY it, are discussed in the next section. Pairs of feature sets and topics are fed into the machine learning algorithm to generate a model. (b) During the prediction, the same feature extractor is used to convert unseen inputs to feature sets. These feature sets are then fed into the model, which generates predicted topics. Our NaiVe Bayesian algorithm: TRAINMULTINOMlALNB ( C,lD )
1 2 3 4 5 6 7 8 9 10
11
V