XXIX International Conference of the Chilean Computer Science Society
Automated Text Binary Classification using Machine Learning approach Alberto Holts Claudio Riquelme Rodrigo Alfaro Escuela de Ingenier´ıa Inform´atica, Escuela de Ingenier´ıa Inform´atica, Escuela de Ingenier´ıa Inform´atica, Pontificia Universidad Cat´olica de Valpara´ıso. Pontificia Universidad Cat´olica Pontificia Universidad Cat´olica de Valpara´ıso,Chile. de Valpara´ıso,Chile. and Departamento de Inform´atica, Email:
[email protected] Email:
[email protected] Universidad T´ecnica Federico Santa Mar´ıa. Valpara´ıso,Chile Email:
[email protected]
Classes (Binary or Multi-Class): Binary classification is the most simple and widely studied case, in which a document is classified into one of two mutually exclusive categories or classes. The binary classification can be extended for solving multi-class problems. • Labels (Single or Multi Labels): When a document can be categorized with a single label or multiple labels at once. • Type (Hard or Ranking): From the hard classification point of view, all pairs are to be found (belonging or not to a class) hdj , ci i ∈ D × C. However, sometimes it may be necessary to develop rankings of categories for a document or rankings of documents for a category. Regardless of changes in the approach of the problem and the algorithms available to solve it, according to [2], the text classification task has different complexities to address: • High-dimensional feature space: Regardless of a particular term selection method, text classification problems involve a high dimensional feature space. • Heterogeneous use of terms: It is possible to express similar information using different terms. • High level of redundancy: While there are many features of interest in the task of classification, several of these features appear regularly in a single document. • Frequency distribution of words: The occurrence of words in natural language follows the Zipf’s law, which indicates that a small number of words occur very often, whereas the majority of words occur very infrequently. In order to achieve automated text categorization, two main steps were defined: firstly, document representation so that they are recognized by the learner algorithm, and second the automated construction of a classifier by a inductive process [3].
Abstract—The increased number of documents in digital format available on the Web and its useful information for different purposes entail an essential need to organize them. However, this task must be automated in order to save costs and manpower. In the community research, the main approach to face this problem is based on the application of machine learning techniques. This article studies the main machine learning approaches to reach an automated text classification.
•
Keywords-Text classification, Machine learning.
I. I NTRODUCTION Large amounts of text in digital format available on the Web contain useful information for different purposes. In coming years, it is expected that the amount of data and applications based on data analysis experience a significant increase, therefore, the classification of this documents is extremely necessary in order to access in an easy and organized manner to them. This task can be performed manually by human experts, but it would mean a high cost, that is the reason why it must be necessarily performed automatically. The objective of this study is to resolve the issue of automated text categorization documents written in Spanish and English, using the most popular approach in the literature, the Machine Learning approach. II. T EXT C LASSIFICATION A. Definition The task of text classification (TC) consists of assigning a Boolean value to each pair hdj , ci i ∈ D × C, where D is the domain of documents and C = c1 , ..., cc is a set of pre-defined categories. A true (T ) value assigned to a pair indicates a decision to classify the document dj under the category ci , while a false value (F ) indicates a decision not to file dj under ci . Therefore, the goal is to approximate the unknown target function that describes how documents ought to be classified using a function called the classifier, that coincide as much as possible. [1] presents a Taxonomy of Automatic Classification Problems based on classes, labels, and types of classification (hard or soft): 1522-4902/10 $26.00 © 2010 IEEE DOI 10.1109/SCCC.2010.30
B. Representation Document representation has a high impact on the task of classification [4]. Even the texts that are already stored in machine readable form, are not, in general, immediately suitable for most learning algorithms. They have to be 212
turned into an appropriate representation for both learning algorithm and classification task. Another issue when working with natural language is that context has a significant influence on the meaning of a piece of text. Different approaches to represent text for text classification recognize or ignore these dependencies to a varying extent. They can be structured according to the level on which they analyze text [5]: •
•
•
• •
Sub-Word Level: decomposition of words and their morphology. A popular technique is the use of Ngrams, this approach consists in split text into strings of n consecutive characters. This representation has given good results in presence of typos. Word Level(single-word): words and lexical information. Its main exponent is the so-called bag-of-words [6]. Multi-Word Level: phrases and syntactic information. Phrases can be considered representations of cooccurrence statistical methods or syntactical methods, where linguistic information of words (verbs, adjectives or nouns) is used to form sentences. Semantic Level: the meaning of text. Pragmatic Level: the meaning of text with respect to context and situation.
•
Binary: Correspond to the binary values that indicates the absence or presence of the term t in the document D. ( 1 fij > 0 w(Di , tj ) = (1) 0 in other case
•
F requency: Correspond to the absolute frequency of the term t in the entire document D. w(Di , tj ) = fij
•
tf : Correspond to the frequency fij value normalized for the size of the document. w(Di , tj ) =
•
(2)
fij |Di |
(3)
tf.idf : Correspond to the normalized frequency (tf ), multiplied by the inverse frequency of the term in the entire collection N . w(Di , tj ) =
n fij j · − log |Di | N
(4)
where nj corresponds to the number of documents that contains the term t. •
A particular method of representing documents, in the different levels mentioned above, corresponds to vector space model, where each document is represented by a vector of terms. Each term is a characteristic that describes the document and it can be pulled from string of size N , simple words, phrases, or a set of concepts. The vector space model is one of the most widely used models for ad-hoc information retrieval, mainly because of its conceptual simplicity and the appeal of the underlying metaphor of using spatial proximity for semantic proximity [7]. To solve the problem of how to weigh terms in the vector space model, the frequency of occurrence of a word in a document could be used as its term weight. However, there are more effective methods of term weighting. The basic information used in term weighting is term frequency, document frequency, and sometimes collection frequency. There are different mappings of text to input space in text classification. [8] combines mappings with different kernel functions in Support Vector Machines. The most popular scheme in the literature for representing documents is the vector space model, a document is represented as a vector d = (w1 , ..., wk ), where k correspond to the size of the set of words (features) that compose it. The value (weight) of each feature indicates how much the term contributes to the document. There are different term weighting methods, we considered five, four are borrowed from Information Retrieval field [6].
And, finally, it considered a new term weighting proposal [9] tf.rf : Correspond to the normalized frequency (TF), multiplied by the frequency of relevant terms (b and c) in the entire collection. w(Di , tj ) =
fij b · − log 2 + |Di | c
(5)
where b is the number of documents in the positive category containing the term tj , and c is the number of documents in the negative category containing the term tj . C. Classification Once represented, the total number of previously classified documents called initial corpus, defined as Ω = d1 , ..., d|Ω| classified under the same set of categories C = c1 , ˙,cc which the system will need to operate with. This means that the values of the total function defined as D × C → T , F are known for every pair hdj , ci i ∈ Ω×C. For evaluating purpose the corpus Ω is divided into two sets: • •
213
Training set: This refers to the sample documents, whose features the classifier will be induced from. Testing set: This refers to the sample documents that will be used to test the effectiveness of the induced classifier. This consists in a comparison between the decision made by the domain expert as well as the
decision made by the classifier, using measures of effectiveness that evaluate quantitatively the coincidence of both results.
category is the most similar to the one of the new document. One of the most common variant is the K-nearest neighbor, k-NN focuses on whether the k training documents, most similar to the new document, have also been classified under the category C; if the answer is positive for a sufficiently large proportion of them, a positive categorization decision is taken; otherwise, a negative decision will be taken.
The most common approaches proposed by the Text Categorization literature are [1]: •
•
•
•
•
Decision rule classifiers: There are conjunctions of conditional formulas or ”Clauses”. Clause premises denote the presence or absence of terms in the test document, while the clause head denotes the decision whether to classify it or not under the C category. One of the advantages of DNF rule inducers is that they tend to generate more compact classifiers than Decision Tree classifiers. These classifiers verify these rules or clauses on the document in order to classify it. The algorithm can remove o merge clauses to make it more compact without compromise his effectiveness. Decision tree classifiers: A DT classifier has a tree structure in which internal nodes are labeled by terms, branches growing out of them are labeled by tests on the weight that the term has in the test document, and leaf nodes are labeled by categories. Such a classifier categorizes a test document d by recursively testing for the weights that the terms labeling of the internal nodes have in vector, until a leaf node is reached; the label of this node is, then, assigned to d. Most of DT classifiers use binary document representations, which consists in binary trees. The Rocchio method: Rocchio method is an adaptation of the Rocchio Algorithm to the text classification. It gives a method to generate the patterns of each category of documents through a formula that takes into account the query, the number of relevant documents and the number of irrelevant documents. With a group of documents previously classified, the vector space model is applied and the pattern can be generated for each class considering positive examples from the training documents and negative examples from the documents of the other categories. Neural networks: A neural network classifier is a network of units, where the input units usually represent terms, the output unit(s) represents the category or categories of interest, and the weights on the edges that connect units represent dependence relations. For classifying a test document d, its term weights w are assigned to the input units; the activation of these units is propagated forward, through the network, and the value that the output unit(s) takes up as a consequence determines the categorization decision(s). Nearest neighbor algorithm and variants: The NN algorithm calculates the similarity between the new documents to be classified, and every document of the previously classified training documents, classifying the new document in the same category of documents. This
This study focuses on Naive Bayes and Support Vector Machine. •
Naive Bayes (NB)[10]: It consists in estimating the probabilities that an object from each class falls in each possible discrete value of vector variable X, and then use Bayes theorem to produce a classification. The number of probabilities that must be estimated are in the order of O(k p ) for pk − valued variables; when time p grows, the estimation becomes impractical. Appropriate independence assumptions allows to approximate the full conditional distribution requiring O(k p ) probabilities with a product of univariate distributions, requiring O(k p ) probabilities per class. For m classes with 1 ≤ k ≤ m, Naive Bayes will be defined as:
p(x|ck ) = p(x1 , ..., xp |ck ) =
p Y
p(xj |ck )
(6)
j=1
•
214
Support Vector Machine (SVM) [5]: It is based on the Structural Risk Minimization (SRM) principle from computational learning theory. The idea of SRM is to find a hypothesis h in which the lowest true error it can be guaranteed. The true error of h is the probability that h makes an error on an unseen and randomly selected test example. An upper bound can be used to connect the true error of a hypothesis h with the error of h on the training set. In geometrical terms, it consists in finding all the surfaces in N -dimensional space that separate the positives from the negatives training examples by the widest possible margin. For example, in a case in which the positive and negative examples are linearly separable, the decision surfaces are (N −1)hyperplanes, various lines may be chosen as decision surface. The SVM method chooses the middle element from the widest set of parallel lines, the best decision surface is determined only by a small set of training examples, called support vectors. In the case where the examples could not be linearly separable, some Kernels functions are used to map the problem to a high dimensional space. To compute this hyperplane is equivalent to solve the following optimization problem:
n
X ~ = 1w minimize : V (w, ~ b, ξ) ~ ·w ~ +C ξi 2 i=1
•
(7)
subject to : ∀ni=1 : Yi [w ~ · x~i + b] ≥ 1 − ξi ; ∀ni=1 : ξ > 0 (8)
Fβ Measure: To get a simple performance measure, the harmonic mean of precision (π) and recall (ρ) is commonly used[11]. It is called Fβ and written as follows: Fβ =
The constraints require that all training examples are classified correctly up to some slack ξi . If a training example lies on the ”wrong” side of the hyperplane, the corresponding ξi will be superior or equal to 1. Therefore, ∀ni=1 is an upper bound on the number of training errors. The factor C is a parameter that allows to trade off training error vs model complexity.
1 + β2 β 2 × πρ
(11)
β is a parameter. The most commonly used value is β = 1, giving equal weight to precision and recall. Finally F1 can be estimated using: F1 = 2 ×
π×ρ π+ρ
(12)
D. Evaluation III. E XPERIMENTAL S ET UP
This section reviews the most commonly used performance measures in TC literature for evaluating text classifiers. Estimators for these measures can be defined based on a contingency table of predictions on an independent test set. This value could be estimated in terms of a contingency table defined by the four possible outcomes of a prediction: T P (true positive) and T N (true negative), F P (false positive) and F N (false negative).
This study revises five representations for text documents: • • • • •
Binary F requency tf tf.idf tf.rf
Table I C ONTINGENCY TABLE .
Category set C = {c1 , ..., cc } classifier judgments
YES NO
The LibSVM [12] tool was used for SVM and an implementation of NB using m-estimator [3] was used to classify. The initial corpus for documents in Spanish was obtained from the Chilean newspaper La Tercera (http://www.latercera.com), 1000 articles from the ”Sport”, ”World”, ”Business” and ”Chile” categories were chosen, of which 2/3 were selected for training and 1/3 were selected for test. The initial corpus for English documents was obtained from the ModApte split of the Reuters-21578 dataset compiled by David Lewis [13]. The same number of training and test examples described before were chosen for the ”Acq” category (acquisition-related news) and non acqrelated categories. Five data set -denotes as Group1, Group2, . . . , Group5- of articles were selected randomly from a selection of training and testing sets. Table II and Table III presents the data set characteristics.
expert judgments YES NO TP FP FN TN
The diagonal cells count how often the prediction was correct; on the contrary, the off-diagonal entries show how often the prediction was wrong. The measures considered in this work are Precision, Recall and F1 . • Precision and Recall: Classification effectiveness is most often measured in terms of classic Information Retrieval notions, precision (π) and recall (ρ). In addition, borrowing terminology from logic, π may seen as the ”degree of soundness” of the classifier, while ρ may be seen as the ”degree of completeness”. In terms of the contingency table described in Table I, these notions are defined as follows: π=
TP TP + FP
(9)
ρ=
TP TP + FN
(10)
Table II C HARACTERISTICS OF R EUTERS PRE - PROCESSED DATA SET.
Dataset Group1 Group2 Group3 Group4 Group5
215
Number of Documents 1000 1000 1000 1000 1000
Vocabulary Size 7123 6747 6831 7006 6744
Table III C HARACTERISTICS OF L A T ERCERA PRE - PROCESSED DATA SET.
Dataset Group1 Group2 Group3 Group4 Group5
Number of Documents 1000 1000 1000 1000 1000
Vocabulary Size 14436 12985 14311 14555 14352
IV. E XPERIMENT R ESULT The F1 results of SVM and Naive Bayes for each representation are listed below in Tables IV and V. In addition, Figure 1 shows a graphic of these measures. Figure 2.
Figure 1.
Boxplot of F1 for La Tercera using SVM.
Different representations and their performance in term of F1.
Table IV F1 AVERAGE FOR R EUTERS USING SVM AND NB
Binary Frequency tf tf.idf tf.rf
SVM 0.9729 ± 0.01427% 0.9764 ± 0.01025% 0.9770 ± 0.00928% 0.9570 ± 0.01177% 0.9703 ± 0.00899%
NB 0.9041 ± 0.02162% 0.9120 ± 0.02162% 0.9041 ± 0.02162% 0.9041 ± 0.02162% 0.9041 ± 0.02162%
Figure 3.
For La Tercera data set, Figure 2 and 3 show similar behavior of performance in terms of F1 for each classifier. There are some differences in the performance using NB, but they are not statistically significant.
Table V F1 AVERAGE FOR L A T ERCERA USING SVM AND NB
Binary Frequency tf tf.idf tf.rf
SVM 0.9792 ± 0.01052% 0.9830 ± 0.00466% 0.9819 ± 0.00426% 0.9815 ± 0.00246% 0.9746 ± 0.00946%
Boxplot of F1 for La Tercera using NB.
For Reuters data set, Figure 5 shows an exact performance between the different representations using NB, with the exception of F requency that present a lower F1 , without statistically significance. Figure 4 presents different performance in terms of F1 among the representations, also with no statistical differences.
NB 0.8067 ± 0.03847% 0.8067 ± 0.03828% 0.8021 ± 0.03316% 0.8067 ± 0.03828% 0.81 ± 0.03828%
216
literature can be achieved. For future work, we plan to compare the development of representations including feature selection techniques to improve classifications performance. R EFERENCES [1] F. Sebastiani, “Machine learning in automated text categorization,” ACM Comput. Surv., vol. 34, no. 1, pp. 1–47, 2002. [2] T. Joachims, Learning to Classify Text Using Support Vector Machines – Methods, Theory, and Algorithms. KluwerSpringer, 2002. [3] T. Mitchell, Machine Learning (Mcgraw-Hill International Edit), 1st ed. McGraw-Hill Education (ISE Editions), October 1997.
Figure 4.
[4] M. Keikha, N. Razavian, F. Oroumchian, and H. S. Razi, “Document representation and quality of text: An analysis.” in In Survey of Text Mining II: Clustering, Classifcation, and Retrieval. Springer-Verlag, London, 2008, pp. 135–168.
Boxplot of F1 for Reuters using SVM.
[5] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Machine Learning: ECML-98, ser. Lecture Notes in Computer Science, C. N´edellec and C. Rouveirol, Eds. Berlin/Heidelberg: Springer-Verlag, 1998, vol. 1398, ch. 19, pp. 137–142. [6] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing and Management: an International Journal, vol. 24, no. 5, pp. 513– 523, 1988. [7] C. Manning and H. Schtze, Foundations of Statistical Natural Language Processing. The MIT Press, 1999. [8] E. Leopold and J. Kindermann, “Text categorization with support vector machines. How to represent texts in input space?” Machine Learning, vol. 46, no. 1-3, pp. 423–444, January 2002.
Figure 5.
[9] M. Lan, C.-L. Tan, and H.-B. Low, “Proposing a new term weighting scheme for text categorization,” in AAAI’06: Proceedings of the 21st national conference on Artificial intelligence. AAAI Press, 2006, pp. 763–768.
Boxplot of F1 for Reuters using NB.
[10] D. J. Hand, P. Smyth, and H. Mannila, Principles of data mining. Cambridge, MA, USA: MIT Press, 2001.
V. C ONCLUSIONS The automated text classification is an important developing field of Information Retrieval and Machine Learning. More approaches have been tested over a known benchmark. In this paper we have tested five term weighting schemes for a Spanish and English data sets, using Support Vector Machine and Naive Bayes classifiers. Results show that the performances for each language are closely similar and that there are not statistically significant differences comparing each term weighting scheme for binary classification. We believe that the contribution of this work is the inclusion of Spanish documents aimed to prove that the same performance for classifying English documents from the TC
[11] Y. Yang, “An evaluation of statistical approaches to text categorization,” Inf. Retr., vol. 1, no. 1-2, pp. 69–90, 1999. [12] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001, software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [13] D. D. Lewis, Y. Yang, T. G. Rose, G. Dietterich, F. Li, and F. Li, “RCV1: A new benchmark collection for text categorization research,” Journal of Machine Learning Research, 2004.
217