1264
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 20,
NO. 9,
SEPTEMBER 2008
Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine Dino Isa, Lam Hong Lee, V.P. Kallimani, and R. RajKumar Abstract—This work implements an enhanced hybrid classification method through the utilization of the naı¨ve Bayes approach and the Support Vector Machine (SVM). In this project, the Bayes formula was used to vectorize (as opposed to classify) a document according to a probability distribution reflecting the probable categories that the document may belong to. The Bayes formula gives a range of probabilities to which the document can be assigned according to a predetermined set of topics (categories) such as those found in the “20 Newsgroups” data set for instance. Using this probability distribution as the vectors to represent the document, the SVM can then be used to classify the documents on a multidimensional level. The effects of an inadvertent dimensionality reduction caused by classifying using only the highest probability using the naı¨ve Bayes classifier can be overcome using the SVM by employing all the probability values associated with every category for each document. This method can be used for any data set and shows a significant reduction in training time as compared to the Lsquare method and significant improvement in the classification accuracy when compared to pure naı¨ve Bayes systems and also the TF-IDF/SVM hybrids. Index Terms—Document classification, Bayes formula, support vector machines.
Ç 1
INTRODUCTION
D
classification can be defined as the task of automatically categorizing collections of electronic documents into their annotated classes based on their contents. In recent years, this has become important due to the advent of large amounts of data in digital form. For several decades now, document classification in the form of text classification systems have been widely implemented in numerous applications such as spam filtering [2], [3], [4], [6], [22], e-mails categorizing [1], [10], directory maintenance, and ontology mapping [11]. An increasing number of statistical and computational approaches have been developed for document classification, including k-nearest-neighbor (k-NN) classification [13], naı¨ve Bayes classification [5], [18], support vector machines (SVMs) [14], maximum entropy [9], decision tree induction, rule induction, and artificial neural networks [19]. Each of the document classification schemes previously mentioned has its own unique properties and associated problems. The decision tree induction algorithm and the rule induction algorithm are simple to understand and interpret. However, these algorithms do not work well when the number of distinguishing features between documents is large [12]. The k-NN algorithm is easy to implement and shows its effectiveness in a variety of OCUMENT
. The authors are with the Faculty of Engineering and Computer Science, University of Nottingham, Malaysia Campus, Jalan Broga, 43500 Semenyih, Selangor Darul Ehsan, Malaysia. E-mail: {dina.isa, kcx4lhl, vp.kallimani, rajprasad.rajkumar}@nottingham.edu.my. Manuscript received 12 July 2007; revised 11 Oct. 2007; accepted 1 Apr. 2008; published online 14 Apr. 2008. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TKDE-2007-07-0357. Digital Object Identifier no. 10.1109/TKDE.2008.76. 1041-4347/08/$25.00 ß 2008 IEEE
problem domains [13]. A major drawback of the k-NN algorithm is that it is computationally intensive, especially when the size of the training set grows large [13]. SVMs can be used as a discriminative document classifier and has been shown to be more accurate than most other techniques for classification tasks [14], [16]. The main problem associated with using the SVM for document classification is the effort needed to transform text data to numerical data. We call this essential step, the “vectorization” step. Techniques such as TF-IDF [30] vectorize the data easily, however, dimensionality then becomes an issue since each vectorized word is used to represent the document. This leads to the number of dimensions being equal to the number of words. For linear classifiers based on the principle of Empirical Risk Minimization, capacity is equal to data dimensionality. For these learning machines, high dimensionality leads to “overfitting” and the hypothesis becomes too complicated to implement computationally. For the SVM on the other hand, over fitting does not occur since its capacity is equal to the margin of separation between support vectors (SVs) and the optimal hyperplane instead of the dimensionality of the data. What is of concern, however, is the high training time and classification time associated with high dimension data (Fig. 9). In this work, we reduce the dimensions from thousands (equal to the number of words in the document, such as when using TF-IDF) to typically less than 20 (number of categories the document may be classified to) through the use of the Bayes formula and then feed this probability distribution to the SVM for training and classification purposes. With this, we hope to reduce training and classification time to a feasible level (reduce dimensions from thousands to 10 or 20) while maintaining acceptable generalization (using the optimal hyperplane) and accuracy (10 or 20 dimensions used for classification instead of 1, as Published by the IEEE Computer Society
ISA ET AL.:TEXT DOCUMENT PREPROCESSING WITH THE BAYES FORMULA FOR CLASSIFICATION USING THE SUPPORT VECTOR...
is the case with the naı¨ve Bayes classifier). This can be seen in Fig. 9. The good generalization characteristic of the SVM is due to the implementation of structural risk minimization, which entails finding an optimal hyperplane, thus guaranteeing the lowest classification error. This, plus a capacity that is independent of the dimensionality of the feature space makes the SVM a highly accurate classifier in most applications [14]. However, the usage of SVM in many realworld applications is relatively complex and time consuming due to its convoluted training and categorizing algorithms (requiring the computation of inner products between SVs and data points). By itself, the naı¨ve Bayes is a relatively accurate classifier if trained using a large data set. However, as in many other linear classifiers, capacity control and generalization remains an issue. In our case, however, the naı¨ve Bayes is used as a preprocessor in the front end of the SVM to vectorize text documents before the classification is carried out (by the SVM). This is done to improve the generalization of the overall system while still maintaining a comparatively feasible training and categorization time afforded through the dimensionality reduction imposed by the Bayes formula. In order to compare the improvements afforded by using the naı¨ve Bayes and SVM in a hybrid system, we present results that compare the classification accuracy of only using the naı¨ve Bayes as compared to using the hybrid system (naı¨ve Bayes vectorizer and SVM classifier). We also compare results obtained using TF-IDF [30] as the vectorizer and also using the Lsquare method as a classifier [29]. In the context of document classification, the Bayes theorem uses the fact that the probability of a particular document being annotated to a particular category, given that the document contains certain words in it, is equal to the probability of finding those certain words in that particular category, times the probability that any document is annotated to that category, divided by the probability of finding those words in any document, as illustrated in (1): PrðCategoryjW ordÞ ¼
PrðW ordjCategoryÞ: PrðCategoryÞ : PrðW ordÞ ð1Þ
Each document contains words that are given probability values based on the number of its occurrence within that particular document. Naı¨ve Bayes classification is predicated on the idea that electronic documents can be classified based on the probability that certain keywords will correctly identify a piece of text document to its annotated category. At the basic level, a naı¨ve Bayes classifier examines a set of text documents that have been well organized and categorized and then compares the content of all categories in order to build a database of words and their occurrences. The database is used to identify or predict the membership of future documents to their right category, according to the probability of certain word occurring more frequently for certain categories. It overcomes the obstacles faced by more static technologies, such as blacklist checking, and word to word comparisons using static databases filled with predefined keywords.
1265
In other words, the overall classifier is robust enough to ignore serious deficiencies in its underlying naı¨ve probability model [23], i.e., the assumption of the independence of probable events. This robustness is encapsulated in the relative magnitudes of the probabilities (relatively constant for the same category), which is important for the SVM as opposed to an accurate “absolute” value of probability for each category. In this paper, we emphasize the hybrid approach that utilizes the simplicity of the Bayes formula as a vectorizer and the good generalization capability of the SVM as a classifier.
2
THE HYBRID CLASSIFICATION APPROACH
2.1 The Hybrid Model We propose, design, implement, and evaluate a hybrid classification method by integrating the naı¨ve Bayes vectorizer and SVM classifier to take advantage of the simplicity of the Bayes technique and the accuracy of the SVM approach. This hybrid approach shows an improved classification performance as compared to the traditional naı¨ve Bayes’s performance by itself. This has overcome the major drawback of using the traditional naı¨ve Bayes technique alone as a classifier [18]. The SVM is unable to handle the data in text format. An electronic text document may contain a huge number of features such as keywords, which are very useful in terms of the classification task. Therefore, to classify text documents by using the SVM classifier, the keywords need to be transformed into vectors, which are suitable for input to the SVM. Generally, the words of a text documents are directly vectorized, thereby transforming the text documents into a numerical format. With TF-IDF for example, the huge number of features transformed from the text data of a single document makes running the SVM a long affair as it takes time to process vectorized data with large dimensions. Many methods have been carried out to overcome this problem by reducing the dimension of the features. In order to reduce the number of features, we introduce the naı¨ve Bayes as the vectorizer for the text documents by using the probability distribution described earlier, where the dimension of the features is based on the number of available categories in the classification task. Since the naı¨ve Bayes classifier is able to handle raw text data via the probability distribution of keyword occurrence and the SVM typically requires preprocessing to vectorize the raw text documents into numerical values, it is natural to use the naı¨ve Bayes as the vectorizer and the SVM as the classifier. The structure of the proposed classification approach is illustrated in Fig. 1. For the purposes of using the SVM as a classifier, after the naı¨ve Bayes vectorizer has been trained, each training document is vectorized by the trained naı¨ve Bayes classifier through the calculation of the posterior probability of the training documents for each existing category based on the Bayes formula. For example, the probability value for a document X to be annotated to a category C is computed as P rðCjXÞ rðCjXÞ. As an assumption that we have a category list as Cat Cat1; Cat Cat2; Cat Cat3; Cat Cat4; Cat Cat5; . . . . . . ; CatN CatN;
1266
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 20,
NO. 9,
SEPTEMBER 2008
Fig. 1. Proposed hybrid classification approach block diagram.
thus, each document has N associated probability values, where document X will have P rðCat rðCat1jXÞ jXÞ; P rðCat rðCat2jXÞ jXÞ; P rðCat rðCat3jXÞ jXÞ; P rðCat rðCat4jXÞ jXÞ; P rðCat rðCat5jXÞ jXÞ; . . . . . . ; P rðCatNjXÞ rðCatNjXÞ. All the probability values of a document are combined to construct a multidimensional array, which represents the probability distribution of the document in the feature space. In this way, all the training documents are vectorized by their probability distribution in feature space, in the format of numerical multidimensional arrays, with the number of dimensions depending on the number of categories. With this transformation, the training documents are suitable for use in constructing the vectorized training data set for the SVM. Our approach uses the same training data set for both the naı¨ve Bayes vectorizer and the SVM classifier, where the naı¨ve Bayes vectorizer uses the raw text document for training purposes, and the SVM classifier uses the vectorized training data supplied by the naı¨ve Bayes vectorizer. The data set was divided into two subsets, where the smaller one was used for training, and the larger one was used for testing. Results from cross validation show a better performance as compared to the approach above as expected since the training set in cross validation was larger than that used for generating the presented results (Fig. 10). As for the classification session, the input to the trained naı¨ve Bayes classifier is now replaced by the unknown text document. The output from the naı¨ve Bayes vectorizer, which is the vectorized data of the input text document, in the format of multidimensional numerical probability values, is used as the input for the SVM for the final classification steps.
The Naı¨ve Bayes Classification Approach for Vectorization—The Bayes Formula Our proposed naı¨ve Bayes classifier [20] performs its classification tasks starting with the initial step of analyzing the text document by extracting words that are contained in the document. To perform this analysis, a simple word extraction algorithm is used to extract each individual word from the document to generate a list of words. The list of words is constructed with the assumption that input document contains words 2.2
Fig. 2. Table of words occurrences and probabilities.
w 1; w 2; w 3; . . . . . . . . . ; w n 1; w n, where the length of the document (in terms of number of words) is n. The list of words is then used to generate a table, containing the probabilities of the word in each category. The column of “Word” is filled with words that are extracted from the input document. For the columns of probabilities of the particular word for each category, the values to be filled will be calculated by the probability classifier in the following stage. The tables below illustrate the use of this method for the input document (Fig. 2). Before the probabilistic classifier performs the calculation of words’ probability for each category, it needs to be trained with a set of well-categorized training data set. Each individual word from all training documents in the same category are extracted and listed in a list of words occurrence for the particular category by using a simple algorithm. Based on the list of word occurrence, the trained probabilistic classifier calculates the posterior probability of the particular word of the document being annotated to a particular category by using the formula which is shown in (2), since each word in the input document contributes to the document’s categorical probability: PrðCategoryjW ordÞ ¼
PrðW ordjCategoryÞ: PrðCategoryÞ : PrðW ordÞ ð2Þ
The derived equation above shows that by observing the value of a particular word, wj, the prior probability of a particular category, Ci, PrðCiÞ can be converted to the posterior probability, PrðCijwjÞ, which represents the probability of a particular word, wj being a particular category, Ci. The prior probability, PrðCiÞ can be computed from (3) or (4): PrðCiÞ ¼
T otal of W ords in Ci T otal of W ords in T raining Dataset ¼
Size of Ci : Size of T raining Dataset
ð3Þ
ð4Þ
ISA ET AL.:TEXT DOCUMENT PREPROCESSING WITH THE BAYES FORMULA FOR CLASSIFICATION USING THE SUPPORT VECTOR...
1267
Meanwhile, the evidence, which we call the normalizing constant of a particular word, wj, PrðwjÞ is calculated by using (5): P occurrence of wj in all categories PrðwjÞ ¼ P : ð5Þ occurrence of all words in all categories The total occurrence of a particular word in every category can be calculated by searching the training database, which is composed from the lists of word occurrences for every category. As previously mentioned, the list of word occurrence for a category is generated from the analysis of all training documents in that particular category during the initial training stage. The same method can be used to retrieve the sum of occurrence of all words in every category in the training database. To calculate the likelihood of a particular category, Ci with respect to a particular word, wj, the lists of word occurrence from the training database is searched to retrieve the occurrence of wj in Ci and the sum of all words in Ci. These information will contribute to the value of Prðwj j CiÞ given in (6): PrðwjjCiÞ ¼ P
occurrence of wj in Ci : occurrence of all words in Ci
ð6Þ
Based on the derived Bayes’ formula for text classification and the value of the prior probability PrðCategoryÞ, the likelihood PrðW ordjCategoryÞ and the evidence PrðW ordÞ, along with the posterior probability, PrðCategory j W ordÞ of each word in the input document being annotated to each category can be measured. The posterior probability of each word being annotated to a category is then filled in at the appointed cells in the table, as illustrated in Fig. 2. After all the “Probability” cells have been filled, the overall probability for an input document to be annotated to a particular category, Ci is calculated by dividing the sum of each of the “Probability” column with the length of the query, n , which is shown in (7): PrðCijDocumentÞ ¼
PrðCi j w 1; w 2; w 3; . . . . . . ; w n 1; w nÞ ; n
ð7Þ
where w 1; w 2; w 3; . . . . . . ; w n 1; w n are the words that are extracted from the input document. The ordinary naı¨ve Bayes classifier is able to determine the right category of an input document by referring to the highest probability value calculated by the trained classifier based on the Bayes formula. The right category is represented by the category that has the highest posterior probability value, PrðCategoryjDocumentÞ, as stated in the Bayes classification rule. As the ordinary naı¨ve Bayes classification algorithm has been proven as one of the poorest performing of classifiers, we have extended the classification step for documents by using the SVM to increase classification accuracy. However, we still use the Bayes formula, but in our case, it is to vectorize the text data before classification by the SVM.
2.3
The Support Vector Machines Classification Approach The classification problem here can be restricted to the consideration of the two-class problem without loss of
Fig. 3. Optimal separating hyperplane.
generality. Multiclass problems can be addressed using multiple binary classifications, a technique known as one against all [21]. Therefore, an understanding of multiclass classification hinges on an understanding of binary classification, which we present in detail here. The ultimate goal is to produce a classifier that will work well on unseen examples, i.e., it generalizes well. Consider the example in Fig. 3. Here, there are many possible linear classifiers that can separate the data, but there is only one that maximizes the margin (maximizes the distance between it and the nearest data point of each class). This linear classifier is termed as the optimal separating hyperplane. We will start with the simplest case: linear machines trained on linearly separable data. Therefore, consider the problem of separating a set of training vectors xi 2 3d belonging to different classes yi 2 f1; 1g. We wish to separate this training set with a hyperplane [23]: w:x þ b ¼ 0:
ð8Þ
There are actually an infinite number of hyperplanes that is able to partition the data into two sets (dashed lines in Fig. 3). According to the SVM methodology, there will just be one optimal hyperplane: the hyperplane lying half way in between the maximal margin (we define the margin as the sum of distances of the hyperplane to the closest training points of each class). The solid line in Fig. 3 represents this optimal separating hyperplane, the margin in this case is d1 þ d2 [24]. Note that only the closest points of each class determine the Optimal Separating Hyperplane. These points are called SVs. As only the SVs determine the Optimal Separating Hyperplane, there is a certain way to represent them for a given set of training points. It has been shown [23] that the maximal margin can be found by minimizing 1=2j jwj j2 : n o min 1=2j jwj j2 : ð9Þ The Optimal Separating Hyperplane can thus be found by minimizing (9) under the constraint (10) that the training data is correctly separated: yi :ðxi :w þ bÞ 1; 8 i:
ð10Þ
The concept of the Optimal Separating Hyperplane can be generalized for the nonseparable case by introducing a cost for violating the separation constraints (10). This can be
1268
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
W ðÞ ¼
n X i¼1
subject to
i
VOL. 20,
NO. 9,
SEPTEMBER 2008
n 1 X i j yi yj Kðxi ; xj Þ; 2 i¼1;j¼1
C i 0;
n X
ð16Þ
i yi ¼ 0:
i¼1
Fig. 4. Mapping onto higher dimensional feature space [26].
done by introducing positive slack variables i in constraints (10), which then becomes [24]: yi :ðxi :w þ bÞ 1 i ; 8 i:
ð11Þ
If an error occurs, the corresponding i must exceed unity, so i i is an upper bound for the number of classification errors. Hence, a logical way to assign an extra cost for errors is to change the objective function (9) to be minimized to n o min 1=2j jwj j2 þ C:ði i Þ ; ð12Þ where C is a chosen parameter. A larger C corresponds to assigning a higher penalty to classification errors. Minimizing (12) under constraint (11) gives the Generalized Optimal Separating Hyperplane. This is a Quadratic Programming (QP) problem, which can be solved here using the method of Lagrange multipliers. After performing the required calculations [23], the QP problem can be solved by finding the LaGrange multipliers, i , that maximizes the objective function in (13): W ðÞ ¼
n X
i
i¼1
n 1X i j yi yj xT i xj ; 2 i;j¼1
ð13Þ
subject to the constraints, 0 i C;
i ¼ 1; . . . ; n;
and
Xn i¼1
i yi ¼ 0:
3
The new objective function is in terms of the Lagrange multipliers, i only. It is known as the dual problem: If we know w, we know all i , if we know all i , we know w. Many of the i are zero, and so, w is a linear combination of a small number of data points. xi with nonzero i are called the SVs [25]. Let tj ðj ¼ 1; . . . ; sÞ be the indices of the s SVs. We can write w¼
s X
tj ytj xtj :
ð14Þ
j¼1
By introducing the kernel function, Kðxi ; xj Þ ¼ ðxi Þ; ðxj Þ ;
The SVM uses only some of these documents (represented by feature vectors) for training and retains a small proportion of the entire set as its SVs. This makes the SVM computationally more efficient, needing less memory resources as compared to other techniques like neural networks, which need relatively larger training data sets in order to generalize accurately. This good generalization performance of the SVM is one of the main reasons for using it in text classification. Although millions of documents may be available for training, the burden will be on the high training time. Since the SVM needs a comparatively small training set to isolate the SVs that form the margins of the optimal hyper plane, training time is reduced compared to other margin classifiers. However, the higher the data dimensionality, the longer the SVM training time, and we have addressed this by using the Bayes formula as a vectorizer. As mentioned previously, another factor that sets the SVM apart is the fact that through its implementation of Structural Risk Minimization, capacity is now independent of data dimensionality. The time to train the SVM is however still proportional to data dimensionality. However, in our case, the vectorization process reduces the dimensions of the data to the number of available categories. In vectorization techniques such as TF-IDF however, the number of dimensions is equal to the number of words in that document. As will be shown later, this sacrifice does not detrimentally affect classification accuracy but buys us a further improved SVM training time, which can be significant for large documents. Using this method, we find the middle ground between the single dimension classification technique afforded by the naı¨ve Bayes classifier (SVM is multidimensional) and the high training time needed for the techniques like TF-IDF SVM hybrids (we reduce data dimensionality and, therefore, the training and classification time as well).
ð15Þ
it is not necessary to explicitly know ð:Þ [27] since the optimization problem (13) can be translated directly to the more general kernel version [26] (see Fig. 4),
EXPERIMENTAL RESULTS
The objective of this evaluation is to determine whether the proposed hybrid approach (Bayes vectorizer and SVM classifier) results in better classification accuracy and performance as compared to the ordinary naı¨ve Bayes classifier, the TF-IDF hybrid (TF-IDF vectorizer and SVM classifier) and the Lsquare method [29]. Initially, we performed the experiment by implementing the naı¨ve Bayes algorithm with different ranking schemes [20]: the naı¨ve Bayes with flat ranking algorithm, the naı¨ve Bayes with round robin tournament ranking algorithm, the naı¨ve Bayes with single elimination tournament ranking algorithm, and the naı¨ve Bayes with High Relevance Keywords Extraction (HRKE) algorithm. These classification algorithms are discussed briefly below [20]. In particular, the naı¨ve Bayes with flat ranking computes the probability distribution by considering all categories in a single round of competition. The round robin version on the other hand calculates the probability distribution by considering two categories at a
ISA ET AL.:TEXT DOCUMENT PREPROCESSING WITH THE BAYES FORMULA FOR CLASSIFICATION USING THE SUPPORT VECTOR...
Fig. 5. Comparison table for performances of Pure Naı¨ve Bayes Algorithms and Hybrid Algorithms (Data set: Vehicles—Wikipedia).
time iteratively until each category has competed with all the other categories, and the winner is determined by the highest winning score. The single elimination method entails finding a winner that has not lost even once within the competition. The HRKE algorithm culls out words such as “a,” “the,” etc., which have a low effect on the classification task because it appears in every document. The algorithms mentioned above determine the right category for input documents by referring to the associated probability values calculated by the trained classifier using the Bayes formula. The right category is represented by the category that has the highest posterior probability value, P rðCategoryjDocumentÞ rðCategoryjDocumentÞ. To evaluate the hybrid classification approach that is proposed in this paper, we implement the naı¨ve Bayes classification algorithms mentioned above in the front end to vectorize raw text data using the associated probability values calculated using the Bayes formula. The SVM is then used to perform the classification task using the Linear, RBF (Radial Basis Function), Sigmoid, and Polynomial kernels. Here, we present the classification results based on the worse-case scenario, where the data sets have been separated into two subsets with different sizes, and the smaller subset for each data set is used as the training set, while the bigger one is used for testing purpose. The results here were seen to be worse than those obtained using cross validation (see Fig. 10). For cross validation, we simply swap the training set and the testing set. Cross validation was used on the 20 Newsgroups data set, and the classifier that is trained with the bigger subset shows better accuracy as compared to the classifier that is trained with the smaller subset, as illustrated in Fig. 10. A data set of vehicle characteristics extracted from Wikipedia is tested in the prototype system for the evaluation of classification performance in handling a database with four categories that have low degrees of similarity. Our selected data set contains four categories of vehicles: Aircrafts, Boats, Cars, and Trains. All the four categories are easily differentiated, and every category has a set of unique keywords. We have collected 110 documents for each category, with a total of 440 documents in the entire data set. There were 50 documents from each category extracted randomly to
1269
Fig. 6. Comparison table for performances of Pure Naı¨ve Bayes Algorithms and Hybrid Algorithms (Data set: Automobiles— Wikipedia).
build the training data set for the classifier. The other 60 documents for each category were used to form the testing data set. Fig. 5 illustrates the table of comparison of the performance between the pure naı¨ve Bayes classification algorithms mentioned above, and the hybrid algorithms perform in conjunction with SVM classification in the back end while classifying the Vehicles data set. The tournament-structure-based ranking algorithms show poor performance in terms of classification results when implemented with the pure naı¨ve Bayes classifier and applied to the Vehicle data set. With the proposed hybrid approach, however, significant improvement in classification accuracy is seen since a 25 percent gain is achieved with the Bayes model implemented with the round robin tournament ranking technique, and a 20.84 percent gain is achieved with the hybrid approach measurement compared with the pure naı¨ve Bayes model, where both the hybrid and ordinary approaches were implemented with the single elimination tournament ranking technique. We also tested our proposed hybrid approach by using another data set of automobile characteristics extracted from Wikipedia, which is a data set with a relatively high number of categories and a high degree of similarities between classes, compared to the Vehicle data set in the previous experiment. This data set contains nine categories of automobiles, differentiated in terms of regions and type of body: American Mini Vans, American Sports Cars, American SUVs, Asian Mini Vans, Asian Sports Cars, Asian SUVs, European Mini Vans, European Sports Cars, and European SUVs. All the categories have a relatively high degree of similarity. However, every category has a set of unique keywords that differentiate the characteristics of each document. We have collected 30 documents for each category, with a total of 270 documents in the entire data set. There were 10 documents from each category extracted randomly to build the training data set, while the other 20 documents from each category were extracted for the purpose of testing the performance of the classifier. Fig. 6 illustrates the table of comparison of the performance between the pure naı¨ve
1270
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
VOL. 20,
NO. 9,
SEPTEMBER 2008
Fig. 8. Classification accuracy of Lsquare, SVM, naı¨ve Bayes with HRKE facility, and NB-SVM hybrid on six categories of the 20 Newsgroups data set.
Fig. 7. Comparison Table for performances of Pure Naı¨ve Bayes Algorithms and Hybrid Algorithms (Data set: Mathematics— Arxiv.org).
Bayes classifier and our proposed hybrid classification algorithm for the Automobile data set. The comparison table illustrated in Fig. 6 has proven again our prediction that the hybrid classification approach has increased performance compared to the ordinary naı¨ve Bayes classifier. The hybrid approach also shows better performance as compared to the naı¨ve Bayes classifier implemented with tournament structure based ranking methods. Similar to the experiment using the Vehicles data set, significant improvement of classification accuracy is gained by implementing the hybrid approach against the pure naı¨ve Bayes model implemented with round robin tournament ranking method and also against the pure naı¨ve Bayes model implemented with single elimination tournament ranking method. The hybrid approach also shows higher accuracy as compared to the naı¨ve Bayes classifier with round robin tournament structure ranking technique. A data set of Mathematics topics extracted from arxiv.org has also been used to test the performance of our proposed hybrid approach of naı¨ve Bayes vectorization and SVM classification algorithm. This data set contains eight categories of mathematics subtopics: Algebraic Geometry, Analysis of PDEs, Combinatorics, Differential Geometry, Mathematical Physics, Number Theory, Probability, and Statistics. All the categories have a high degree of similarity with each other. Due to the categories of this data set being the subtopics of the mathematics subject, common mathematical terms are widely used by the documents from all the subtopics. There are only a few specific and unique terms to differentiate each topic, and the frequency of occurrence of specific terms in a category is the main key to differentiating these topics. We have collected 40 documents for each category, with a total of 320 documents in the entire data set. There were 10 documents from each category extracted randomly to build the training data set, while the rest of the 30 documents from each category were extracted for the purpose of testing the performance of the classifier. Fig. 7 illustrates the table of comparison of the performance between the pure
naı¨ve Bayes classifier and our proposed hybrid classification algorithm for the Mathematics data set. The experimental results illustrated in the comparison table shown in Fig. 7 have gone against our prediction that the implementation of the hybrid classification approach should show an improvement in the classification performance as compared to the ordinary naı¨ve Bayes classifier. The hybrid approach has a slightly reduced classification performance when implemented using the naı¨ve Bayes vectorizer with the flat ranking technique and also with the flat ranking naı¨ve Bayes vectorizer with HRKE. However, the two tournament structure ranking techniques have reasonable good performance compare to the flat ranking technique, which has a 81.25 percent classification accuracy, when implemented with the ordinary pure naı¨ve Bayes classifier. The main reason why the hybrid approach shows worst results with this database as compared to the naı¨ve Bayes classifier is that the classes (categories) for this database share many common keywords. As such, differentiating between categories using only these keywords (without domain specific knowledge) can be highly inaccurate. However, the naı¨ve Bayes performs better here because the classification is based on the highest probability category only and does not consider the probability values associated with other categories as does the SVM. Hence, the ambiguity associated with similar probability distributions is avoided. To provide a basis for comparing our hybrid technique with those found in the literature [14], [18], [19], [29], [30], we make the following observations. We compare our proposed naı¨ve Bayes classifier and the NB-SVM hybrid classifier with the Lsquare and the SVM presented in [29] (Fig. 8). The experiment in [29] was repeated using six arbitrarily chosen categories from the 20 Newsgroups data set to compare the performance of the competing classification approaches across the categories. Besides this, we also compare training and testing time consumption for the naı¨ve Bayes classifier, with the time consumption for the naı¨ve Bayes—SVM hybrid (our work here) and the published time consumption for the Lsquare method [29]. We have also implemented our own version of the TF-IDF using the commonly used formula and have also obtained a training time using this vectorizer for the SVM. The TF-IDF SVM hybrid performs badly due to the fact that the SVM performs a multidimensional comparison simultaneously. In other words, if a differentiating keyword
ISA ET AL.:TEXT DOCUMENT PREPROCESSING WITH THE BAYES FORMULA FOR CLASSIFICATION USING THE SUPPORT VECTOR...
Fig. 9. Comparison to other published techniques. * No comparable results available. However, Lsquare takes 2 min., 27 secs. [29] to test 19,800 documents. Our test time is for 5,800 documents. All classifiers trained with 200 documents from the 20 Newsgroup dataset.
1271
Fig. 10. Cross validation results for the NB-SVM hybrid using all the 20 categories of the 20 Newsgroups data set.
our research work is also geared toward the implementaoccurs in a different location in the test document as compared to documents in the training data set, the SVM will not classify that document in the same category although it has the same keyword. This leads to the low classification accuracy when vectorization is done with the TF-IDF. In current literature [30], TF-IDF was used independent of the SVM, and no comparable results are available. The SVM training time using TF-IDF is around 15 econds (plus 2 seconds to calculate TF-IDF giving a total of 17 seconds in Fig. 9) and is due to the very high dimensions associated with this technique. The system specifications for running all the algorithms are Pentium 4 2.0 GHz, 512 Mbytes of RAM, and Windows 2000.
4
CONCLUSION
AND
FUTURE WORK
A hybrid text document classification approach is described here. This is achieved through the implementation of the naı¨ve Bayes method at the front end for raw text data vectorization, in conjunction with a SVM classifier at the back end to classify the documents to the right category. The results from our experiments show that the proposed hybrid approach of the naı¨ve Bayes vectorizer and SVM classifier has improved classification accuracy compared to the pure naı¨ve Bayes classification approach. However, this was not seen with the Mathematics data set (Fig. 7), which has a high degree of similarity between keywords. In this case, this was due to the fact that using only keywords to differentiate between the different subtopics was highly inaccurate. This is caused by the way the SVM considers the probabilities of all categories in its classification task. The naı¨ve Bayes classifier on the other hand uses only the highest probability category to identify the correct class of document. For the case where there are a lot of common keywords between categories, using this technique seems to show better performance. In comparison with other techniques (Figs. 8 and 9), we have shown better performance in terms of training time, testing time, and classification accuracy. The addition of the SVM to the naı¨ve Bayes approach adds only one second to the train and test times. In the future, our research will emphasize enhancing the ability and performance of our existing prototype for data sets with many common keywords between categories through the introduction of facilities such as limited natural language processing to handle word stem classification and SOM clustering to aid the SVM classification. Besides this,
tion in other engineering fields such sensor monitoring for the oil and gas industry [17].
REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
[14] [15]
[16]
[17]
B. Kamens, Bayesian Filtering: Beyond Binary Classification. Fog Creek Software, 2005. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, “A Bayesian Approach to Filtering Junk E-Mail,” Proc. AAAI Workshop Learning for Text Categorization, 1998. S.J. Delany, P. Cunningham, and L. Coyle, “An Assessment of Case-Based Reasoning for Spam Filtering,” Artificial Intelligence J., vol. 24, nos. 3-4, pp. 359-378, 2005. P. Cunningham, N. Nowlan, S.J. Delany, and M. Haahr, “A CaseBased Approach in Spam Filtering that Can Track Concept Drift,” Proc. ICCBR Workshop Long-Lived CBR Systems, 2003. A. McCallum and K. Nigam, “A Comparison of Event Models for Naı¨ve Bayes Text Classification,” J. Machine Learning Research 3, pp. 1265-1287, 2003. S.J. Delany, P. Cunningham, A. Tsymbal, and L. Coyle, “A CaseBased Technique for Tracking Concept Drift in Spam Filtering,” J. Knowledge Based Systems, vol. 18, nos. 4-5, pp. 187-195, 2004. S. Block, D. Medin, and D. Osherson, “Probability from Similarity,” technical report, Northwestern Univ., Rice Univ., 2002. P.A. Flach, E. Gyftodimos, and N. Lachiche, “Probabilistic Reasoning with Terms,” technical report, Univ. of Bristol, Louis Pasteur Univ., 2002. K. Nigam, J. Lafferty, and A. McCallum, “Using Maximum Entropy for Text Classification,” Proc. IJCAI Workshop Machine Learning for Information Filtering, pp. 61-67, 1999. Y. Xia, W. Liu, and L. Guthrie, “Email Categorization with Tournament Methods,” Proc. Int’l Conf. Application of Natural Language (NLDB), 2005. X. Su, “A Text Categorization Perspective for Ontology Mapping,” technical report, Dept. of Computer and Information Science, Norwegian Univ. of Science and Technology, 2002. J.R. Quinlan, C4.5: Program for Machine Learning. Morgan Kaufmann, 1993. E.H. Han, G. Karypis, and V. Kumar, “Text Categorization Using Weight Adjusted k-Nearest Neighbour Classification,” Dept. of Computer Science and Eng., Army HPC Research Center, Univ. of Minnesota, 1999. T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf. Machine Learning (ECML ’98), pp. 137-142, 1998. M. Hartley, D. Isa, V.P. Kallimani, and L.H. Lee, “A Domain Knowledge Preserving in Process Engineering Using Self-Organizing Concept,” technical report, Intelligent System Group, Faculty of Eng. and Computer Science, Univ. of Nottingham, Malaysia Campus, 2006. S. Chakrabarti, S. Roy, and M.V. Soundalgekar, “Fast and Accurate Text Classification via Multiple Linear Discriminant Projection,” VLDB J., Int’l J. Very Large Data Bases, pp. 170-185, 2003. D. Isa, R. Rajkumar, and K.C. Woo, “Defect Detection in Oil and Gas Pipelines Using the Support Vector Machine,” Proc. WSEAS Conf. Circuits, Systems, Electronics and Comm., Dec. 2007.
1272
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
[18] S.B. Kim, H.C. Rim, D.S. Yook, and H.S. Lim, “Effective Methods for Improving Naı¨ve Bayes Text Classifiers,” Proc. Seventh Pacific Rim Int’l Conf. Artificial Intelligence (PRICAI ’02), vol. 2417, 2002. [19] H. Brucher, G. Knolmayer, and M.A. Mittermayer, “Document Classification Methods for Organizing Explicit Knowledge,” technical report, Research Group Information Eng., Inst. Information System, Univ. of Bern, 2002. [20] D. Isa, L.H. Lee, and V.P. Kallimani, “A Polychotomizer for CaseBased Reasoning beyond the Traditional Bayesian Classification Approach,” J. Computer and Information Science, Canadian Center of Science and Education, vol. 1, no. 1, pp. 57-68, Feb. 2008. [21] V.N. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995. [22] C. O’Brien and C. Vogel, “Spam-Filters: Bayes versus Chisquared; Letter versus Words,” Proc. First Int’l Symp. Information and Comm. Technologies (ISCIT), 2002. [23] S. Haykin, Neural Networks. A Comprehensive Foundation, second ed. Prentice Hall, 1999. [24] B. Gutschoven and P. Verlinde, “Multi-Modal Identity Verification Using Support Vector Machines (SVM),” technical report, Signal and Image Centre, Royal Military Academy, citeseer.ist. psu.edu/gutschoven00mul timodal.html. [25] S. Gunn, “Support Vector Machines for Classification and Regression,” technical report, Information: Signals, Images, Systems (ISIS) Research Group, Univ. of Southampton, http://www.ecs.soton. ac.uk/~srg/publications/pdf/SVM.pdf, 1998. [26] M. Law, “A Simple Introduction to Support Vector Machines,” Dept. of Computer Science and Eng., Michigan State Univ., Lecture for CSE 802, www.cse.msu.edu/~lawhiu/intro_ SVM_new.ppt. [27] C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” technical report, Data Mining and Knowledge Discovery, Bell Laboratories, Lucent Technologies, research. microsoft.com/~cburges/papers/SVMTutorial.pdf, 1998. [28] V. Kecman, “Support Vector Machines Basic,” Report 616, School of Eng., Univ. of Auckland, http://genome.tugraz.at/Medical Informatics2/Intro_to_SVM_Report_616_V_Kecman.pdf, 2004. [29] H. Al-Mubaid and S.A. Umair, “A New Text Categorization Technique Using Distributed Clustering and Learning Logic,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 9, Sept. 2006. [30] P. Soucy and G.W. Mimeau, “Beyond TF-IDF Weighting for Text Categorization in the Vector Space Model,” Proc. 19th Int’l Joint Conf. Artificial Intelligence (IJCAI ’05), pp. 1130-1135, 2005. [31] D. Isa, V.P. Kalimani, and L.H. Lee, “Using the Self Organizing Map for Clustering Text Documents,” Expert Systems with Applications submitted for publication, 2007.
VOL. 20,
NO. 9,
SEPTEMBER 2008
Dino Isa received the BSEE (Hons) degree in electrical engineering from the University of Tennessee, Knoxville, in 1986 and the PhD degree from the University of Nottingham, University Park, Nottingham, England, in 1991. He was with Motorola Seremban as the engineering section head and then moved to Crystal Clear Technology in 1996 as the chief technology officer prior to joining the University of Nottingham in 2001. He is an associate professor at the Faculty of Engineering and Computer Science, University of Nottingham, Malaysia Campus. His current interest is in applying the support vector machine in various domains in order to further understand its mechanisms. Lam Hong Lee received the bachelor’s degree in computer science from University Putra, Malaysia, in 2004 and is currently in his third year of his PhD program at the Faculty of Engineering and Computer Science, University of Nottingham, Malaysia Campus. His current research interest lies in improving text categorization using AI techniques.
V.P. Kallimani received the BEng degree in electronics engineering from Bangalore University, India, in 1984 and the master’s degree in electronics engineering from the Birla Institute of Technology in 1990. He was first employed on the Karnataka Electricity Board in 1984 and moved to the telecommunications sector of the same company in 1986. He joined the University’s School of Computer Science in 2002. He is currently an assistant professor in the Faculty of Engineering and Computer Science, University of Nottingham, Malaysia Campus. His current research interests include knowledge engineering, specifically in data mining, case-based reasoning, and AI techniques relevant to CBR. R. RajKumar received the master’s degree from the School of Electrical and Electronic Engineering, Faculty of Engineering and Computer Science, University of Nottingham, Malaysia Campus, in 2005. He is currently a PhD student in the same university. His main interest is in using the Support Vector Machine to predict failures in oil and gas pipelines.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.