Combining Active Learning and Boosting for Naıve Bayes Text ...

1 downloads 0 Views 233KB Size Report
Combining Active Learning and Boosting for Na¨ıve Bayes Text Classifiers. *. Han-joon Kim1 and Je-uk Kim2. 1. Department of Electrical and Computer ...
Combining Active Learning and Boosting for Na¨ıve Bayes Text Classifiers Han-joon Kim1 and Je-uk Kim2 1

Department of Electrical and Computer Engineering, University of Seoul, Korea [email protected] 2 Daewoo Information Systems Company, Korea [email protected]

Abstract. This paper presents a variant of the AdaBoost algorithm for boosting Na¨ıve Bayes text classifier, called AdaBUS, which combines active learning with boosting algorithm. Boosting has been evaluated to effectively improve the accuracy of machine-learning based classifiers. However, Na¨ıve Bayes classifier, which is remarkably successful in practice for text classification problems, is known not to work well with the boosting technique due to its instability of base classifiers. The proposed algorithm focuses on boosting Na¨ıve Bayes classifiers by performing active learning at each iteration of boosting process. The basic idea is to induce perturbation of base classifiers by augmenting the training set with the most informative unlabeled documents. Keywords: Text classification, Na¨ıve Bayes, boosting, active learning, selective sampling

1

Introduction

Boosting is a general technique for improving the accuracy of machine-learning based classifiers, which combines a series of base (or weak) classifiers to produce a single powerful classifier. Since the boosting technique was developed by Yoav Freund and Rob Schapire [4], it has been considered to be one of the best approach to improving classifiers in many previous studies [13]. In particular, boosting contributes to significantly improve the decision tree learning algorithm [5, 6, 11]. In our work, we focus on boosting Na¨ıve Bayes classifier, which is a simple yet surprisingly accurate technique and has been used in many classification projects [1]. Particularly, for text classification problems, Na¨ıve Bayes classifier is known to be remarkably successful in practice, despite the fact that text data generally has a huge number of attributes (features) [10]. For this reason, it is very significant to apply boosting to Na¨ıve Bayes text classifiers. However, Na¨ıve Bayes classifier is known to perform poorly with boosting [14]. This is because effective boosting, in principle, requires high variance (or instability) in 

This research was supported by University of Seoul, Korea, in the year of 2003.

Q. Li, G. Wang, and L. Feng (Eds.): WAIM 2004, LNCS 3129, pp. 519–527, 2004. c Springer-Verlag Berlin Heidelberg 2004 

520

Han-joon Kim and Je-uk Kim

the accuracy of base classifiers [14], but Na¨ıve Bayes is relatively stable with respect to changes of training set. This paper presents a variant of the AdaBoost algorithm for boosting Na¨ıve Bayes text classifier, called AdaBUS, which combines active learning with the AdaBoost boosting algorithm. In active learning approach, learner automatically selects the most informative examples for class labeling and training, without depending on a teacher’s decision or random sampling. To this end, we use uncertainty-based selective sampling method [7, 8], which has been frequently used for learning with text data. For selective sampling, we propose a uncertainty measure fit for Na¨ıve Bayes learning framework.

2

Preliminaries

2.1

Na¨ıve Bayes Learning Framework for Text Classification

Learning a Na¨ıve Bayes (NB) text classifier means estimating the parameters of generative model by using a set of labeled training documents. The estimated classification model is composed of two parameter: the word probability estimates θˆw|c , and the class prior probabilities θˆc ; that is, classification model θˆN B = {θˆw|c , θˆc }. Each parameter can be estimated according to maximum a posteriori (MAP) estimation1 . For classifying a given document, Na¨ıve Bayes learning method estimates the P r(cj )·P r(di |cj ) , posterior probability of a class via Bayes’ rule; that is, P r(cj |di ) = P r(di ) where P r(cj ) is the class prior probability that any random document from the document collection belongs to the class cj , P r(di |cj ) is the probability that a randomly chosen document from documents in the class cj is the document di , and P r(di ) is the probability that a randomly chosen document from the whole collection is the document di . The document di is then assigned to a class argmaxcj ∈C P r(cj |di ) with the most posterior2 . Here, the document di is represented by a bag of words (wi1 , wi2 , · · · , wi|di | ) where multiple occurrences of words are preserved. Moreover, Na¨ıve Bayes classifier is based on the simplifying assumption that the terms in a document are mutually independent and the probability of term occurrence is independent of position within the document. This assumption results in the following classification function |di | P r(wik |cj ). fθˆN B (di ) = argmaxcj ∈C P r(cj |di ) = argmaxcj ∈C P r(cj ) · k=1 To generate this classification function, P r(cj ) can be simply estimated by counting the frequency with which each class value cj occurs in a set of the training documents Dt , where P r(cj |di ) ∈ {0, 1}, given by the class label. That is, P r(cj ) = θˆcj =

 |Dt |

P r(cj |di ) . |Dt |

As for P r(wik |cj ), its maximum likelihood estimate using Laplace’s law of succession [9] is P r(wik |cj ) = θˆwik |cj = 1

2

i=1

The MAP estimation is to find maximally probable model θM AP among possible models Θ, given a set of training documents D; that is, θM AP ≡ r(θ) = argmaxθ∈Θ P r(D|θ) · P r(θ). argmaxθ∈Θ P r(θ|D) = argmaxθ∈Θ P r(D|θ)·P P r(D) argmaxx∈X f (x) is the value of x that maximizes f (x).

Combining Active Learning and Boosting for Na¨ıve Bayes Text Classifiers

521

Table 1. The AdaBoost Algorithm with Na¨ıve Bayes. Input: A set of training documents Dt = {di , cj |di ∈ D, cj ∈ C} Output: A classifier fθˆN B where θˆNB = {θˆw|c , θˆc } lNB (Dtl ) = argmaxθˆN B P r(Dt |θˆNB ) · P r(θˆNB ) /∗ MAP estimation ∗/ |di | P r(wik |cj ) fθN B (di ) = argmaxcj ∈C P r(cj |di ) = argmaxcj ∈C P r(cj ) · k=1 (t) (t) (t) w = (w1 , · · · , w|D t | ) /∗ weight distribution ∗/ (1)

1. Initialize: wdi = |D1t | for all di ∈ Dt 2. Learn: repeat for t = 1 to T (a) Estimate a classification model with respect to the weighted training documents θˆNB = lNB (Dt ) (b) Build a base classifier fθˆN B with the estimated model θˆNB (c) Calculate the weighted training error t of θˆ NB    (t) (t) t = di ∈D t wdi · I fθˆ (di ) = fθ (di ) NB /∗ I(x) = 1 if x is true, 0 otherwise.∗/ ˆNB : of θ (d) Calculate confidence α t   t αt = 12 ln 1− t (e) Update weights for the next iteration:  −α (t) (t) e t if fθˆ (di ) = fθ (di ) wd (t+1) i NB = zt × wdi (t) e+αt if fθˆ (di ) = fθ (di ) NB

where zt is a normalization factor so that w(t) is a probablistic distribution,  (t) i.e., di ∈D t wdi = 1

3. Return the final classifier: (t) (a) weighted majority voting of the generated classifiers {fθˆ }Tt=1 : NB   T  αt (t) fθˆN B (d) = argmaxci∈C t=1  T α · I fθˆ (d) = fθ (d) r=1



r

NB

tfcj (wik )+1 tfcj (w)+|V | ,

where tfcj (w) is the number of occurrences of the word w in the class cj and V denotes the set of significant words extracted from the training documents. w∈V

2.2

The AdaBoost Boosting Algorithm with Na¨ıve Bayes

This section presents the procedure of AdaBoost algorithm with Na¨ıve Bayes learning, which is described by using notations introduced in the previous section and Na¨ıve Bayes learning framework. As shown in Table 1, the boosting algorithm is composed of two phase: the learning phase and the voting phase. The learning phase induces a series of base classifiers while repeatedly updating the distribution of weights of training examples, based on previously generated base classifiers. Initially, the boosting process sets the initial weight of

522

Han-joon Kim and Je-uk Kim

each of training documents to be |D1t | (see line 1). At each iteration, the process estimates a classification model θˆN B with respect to the weighted training documents, and then generates a base classifier using the estimated model θˆN B as shown in lines 2(a)-(b). After a base classifier is learned, the weights of training examples are individually modified to allow the subsequent classifier to focus on misclassified training examples. For this, the weighted training error t for the estimated model θˆN B is calculated as shown in line 2(c)3 , and then, with the error t , the confidence αt that measures the importance of θˆN B is computed. This measure gets larger as t gets smaller, as shown in line 2(d). After that, the confidence measure αt is used for re-weighting training documents as shown in line 2(e); that is, the process increases the weight of documents incorrectly classified by the currently estimated model θˆN B whereas it decreases the weight of correctly classified documents. After performing T rounds, in the voting phase, the process combines the T multiple base classifiers into an improved classifier through voting. As shown in line 3, the boosted final classifier returned is a weighted majority vote of the base classifiers, in which the confidence αt is re-used for voting with the weight corresponding to its value.

3

Boosting Na¨ıve Bayes

3.1

Basic Idea

As mentioned before, to achieve effective boosting of Na¨ıve Bayes, we should find how to increase the instability of base classifiers. Our strategy for boosting Na¨ıve Bayes is an incremental augmentation of a set of training documents through active learning approach. In active learning approach, learner actively chooses the informative training documents from a pool of unlabeled documents with the aim of reducing the number of training examples while maintaining high classification accuracy [2]. At each iteration of boosting process, well chosen informative documents are added to the current training set according to the principle of active learning. By learning with the augmented training set different from the previous one, a newly generated classification model is expected not to be similar to the previous model within a set of all possible models. This perturbation of training set can introduce instability into Na¨ıve Bayes classifiers. Besides, since the training set is augmented with highly informative training examples, a base classifier built by learning with such examples is expected to be gradually improved at each iteration. The proposed strategy can prevent an increase in classification error of the base classifiers probably incurred by the intended instability. Consequently, the base classifiers with higher instability and higher accuracy are combined into a more powerful boosted classifier. 3

The equation fθ (d) = cj represents that the the human labeler (who is assumed to know the true classification model θ) determines the true class label of a given document di to be cj .

Combining Active Learning and Boosting for Na¨ıve Bayes Text Classifiers

523

Table 2. Modification to the AdaBoost Algorithm with Na¨ıve Bayes: AdaBUS. Input: A set of training documents Dt = {di , cj |di ∈ D, cj ∈ C} A set of unlabeled documents Du = {dk |dk ∈ D} Output: A classifier fθˆN B where θˆNB = {θˆw|c , θˆc } (1)

1. Initialize: wdi = |D1t | for all di ∈ Dt 2. Learn: repeat for t = 1 to T (a) Estimate a base classifier with respect to the weighted training documents θˆNB = lNB (Dt ) (b) Build a base classifier fθˆN B with the estimated model θˆNB (c) Calculate the weighted training error t of θˆ NB    (t) (t) t = di ∈D t wdi · I fθˆ (di ) = fθ (di ) NB ˆ (d) Calculate confidence  αt of θNB :  1−t 1 αt = 2 ln t (e) Find a document with the largest uncertainty from unlabeled documents: dmax = argmaxd∈D u CUNB (d) (f) Break if CU(dmax ) < µ /∗ µ is a threshold value for selective sampling /∗ (h) Augment the current training set by labeling the selected document: Dt = Dt ∪ dmax , fθˆN B (dmax ) (g) Determine the initial weight of the selected document: 

t

w

(t)

di wdmax = di ∈D |D t | (i) Update weights:  −α (t) (t) e t if fθˆ (di ) = fθ (di ) wd (t+1) i NB wdi = zt × (t) e+αt if fθˆ (di ) = fθ (di ) NB

3. Return the final classifier: (t) (a) weighted majority voting of the generated classifiers {fθˆ }Tt=1 : NB   T  αt (t) fθˆN B (d) = argmaxci∈C t=1  T α · I fθˆ (d) = fθ (d) r=1

r

NB

The challenging problem is how to isolate the best candidate training examples from a set of unlabeled document set. In our work, we adopt the uncertaintybased selective sampling [7] because it has been frequently used for learning with text data. The sampling process is performed based on classification uncertainty which is the degree of uncertainty in the classification of a test example with respect to the currently derived model. In the following subsection, an uncertainty measure fit for Na¨ıve Bayes learning method is devised. 3.2

Adaptive Boosting with Uncertainty-Based Selective Sampling: AdaBUS

Uncertainty-Based Selective Sampling. As we have already seen, Na¨ıve Bayes learning method develops the probability distribution over words W for

524

Han-joon Kim and Je-uk Kim

each given class c (i.e., P r(W |c)) that accounts for the concept of that class. In this regard, if a document’s classification is uncertain under the current model, we can say that the word distribution for its correct class is still not well developed for classification. In such case, the probability distribution over words occurring in the input document for its correct class is similar (or near) to those of other classes. From this, we find that the classification uncertainty can be determined by measuring the distances between the word distributions learned. Now, we propose an uncertainty measure based on the Kullback-Leibler (KL) divergence, which is a standard information-theoretic measure of distance between two probability mass functions [3]. For a document d, its KL divergence between the word distributions induced by the two classes ci and cj ,  r(wk |ci ) KLdistd (P r(W |ci ), P r(W |cj )) is defined as wk ∈d P r(wk |ci ) · log PP r(w . Its k |cj ) classification uncertainty CU(d) is defined as follows:  ci ,cj ∈C KLdistd (P r(W |ci ), P r(W |cj )) (1) CU(d) = 1 − |C| · (|C| − 1) where |C| denotes the total number of the existing classes. Note that the value of KLdistd (·) is measured, not over all words but over only those words belonging to a categorized document d. Incorporating Active Learning into AdaBoost. With the proposed uncertainty measure, tasks for incremental updates of the training set are performed in the conventional boosting process. Table 2 shows modification to the AdaBoost algorithm that includes the additional selective sampling tasks. The lines 2(e)(h) corresponding to the selective sampling task are added to the Table 1 of AdaBoost algorithm. As shown in line 2(e), the process finds a document with the largest uncertainty from a given unlabeled documents Du . The selected document dmax is then checked whether its classification uncertainty is larger than a given threshold value µ (see line 2(f)). If so, its appropriate class label is assigned to the document by the human labeler, and moreover the initial weight of dmax is set to the average of current weights of training examples (see lines 2(g)-(h)). The initially determined weight will be modified according to the re-weighting rule. In our algorithm, the optimal time when to stop the process is determined through tuning the threshold value of uncertainty. By doing so, we can adaptively determine the termination time according to the status of current learning process. The reason why the termination policy is that the selected documents’ low uncertainty suggests that further iterations will not generate any more refined base classifiers different from previous one. If the training set is sufficiently augmented, the base classifier will probably become stable without further improvement.

Combining Active Learning and Boosting for Na¨ıve Bayes Text Classifiers

525

Table 3. Setup of test document sets. Category Total number of Number of Number of Number of documents initial training set test document set unlabeled training set acq 264 5 30 229 crude 165 5 30 130 earn 504 5 30 469 interest 249 5 30 214 money-fx 179 5 30 144 ship 222 5 30 187 trade 150 5 30 115

4 4.1

Experimental Results Experimental Setup

In order to evaluate our method, we used the Reuters-21578 SGML document collection [12] which has been most commonly used in the text classification literature. This data set consists of 21,578 articles, each one pre-labeled with one or more of 135 topics (categories). However, the Reuters collection has a very skewed distribution of documents over 135 categories, and thus it has often been criticized as a poor collection for classification. For a more reliable evaluation, we generated a subset of the Reuters collection in which documents are not skewed over categories. First, we selected the documents belonging to the 7 most frequent categories including ‘acq’, ‘crude’, ‘earn’, ‘interest’, ‘money-fx’, ‘ship’, ‘trade’. Among those selected documents, we again chose 1,733 documents that had a single category in order to avoid the ambiguity of documents with multiple topics. This controlled document sets are described in detail in Table 3. In our experiment, the simulation results are discussed with respect to the F1-measure, which gives equal weight to both recall and precision. This measure varies from 0 to 1, and are proportionally related to classification effectiveness. With the F1measure for each category, we compute the overall F1-measure for the collection by macro-averaging. 4.2

Performance Analysis

Effect of Selective Sampling in Boosting Process. Figure 1 illustrates the changes in F1-measure according to the number of training documents added to the initial training set. Note that the conventional AdaBoost algorithm without active learning does not require augmenting the initial training set, unlike the proposed method (denoted as AdaBUS) and uncertainty-based sampling method (denoted as US). Thus, in case of AdaBoost and NB, the total number of training documents should be equal to the sum of the number of the initial training documents and the number of appended training documents (which is represented in horizontal axis of Figure 1). Therefore, all the methods suggested in our experiment require the same amount of human effort.

526

Han-joon Kim and Je-uk Kim $GD%86)

1%)

$GD%RRVW)

86)

0.9

)PHDVXUH

0.85 0.8 0.75 0.7 0.65 0.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

1XPEHURI$GGHG7UDLQLQJ'RFXPHQWV

Fig. 1. Changes in F1-measure from varying the number of appended training documents: AdaBUS denotes the proposed method, NB Na¨ıve Bayes without boosting, AdaBoost the conventional AdaBoost algorithm, and US uncertainty-based selective sampling method.

As shown in this figure, we have observed that the proposed AdaBUS method is successful in boosting Na¨ıve Bayes; AdaBUS could increase the quality of Na¨ıve Bayes classifier with an average increase of 10% in the F1-measure over pure Na¨ıve Bayes algorithm (NB). As for the AdaBoost algorithm, [14] described that its boosting process cannot increase the quality of Na¨ıve Bayes classifier. Even, in our experiment, the AdaBoost algorithm is worse than pure Na¨ıve Bayes in most cases, as shown in Figure 1. This is probably because each of base classifiers can have a lower accuracy4 through learning with an insufficiently number of training examples. In addition, our experiment includes a comparison with uncertainty-based sampling method (denoted as US) because we cannot exclude the possibility that only the sampling method outperforms AdaBUS method. Fortunately, we can observe that AdaBUS method gives benefit over the sampling method. However, note that uncertainty-based sampling improves the accuracy of Na¨ıve Bayes classifier, as shown in Figure 1, since it can allow to compose training set with the most informative documents from a pool of unlabeled documents. In short, the proposed method can make it possible to increase variance and simultaneously to increase F1-measure of base classifiers in Na¨ıve Bayes learning, which results in effectively boosting Na¨ıve Bayes classifier.

5

Conclusions and Future Work

We have presented a method for boosting Na¨ıve Bayes text classifiers by using active learning (i.e., unceratinty-based selective sampling) approach. The basic idea behind our approach is to increase the variance in base classifiers by incrementally augmenting a set of training documents with selectively sampled 4

Basically, in case of binary classifier, effective boosting requires that the accuracy of base classifier is larger than 0.5.

Combining Active Learning and Boosting for Na¨ıve Bayes Text Classifiers

527

documents. To this end, we propose a special uncertainty measure fit for Na¨ıve Bayes learning. In the future, we plan to combine EM algorithm with AdaBUS algorithm considering the fact that EM process can further improves the base classifiers without additional human efforts.

References 1. R. Aggrawal, R.J. Bayardo, and R. Srikant, “Athena: Mining-based Interactive Management of Text Databases,” Proceedings of the 7th International Conference on Extending Database Technology, pp. 365–379, 2000. 2. S. A.-Engelson, and I. Dagan, “Committee-Based Sample Selection for Probabilistic Classifiers,” Journal of Artificial Intelligence Research, Vol. 11, pp. 335–360, 1999. 3. T.M. Cover, and J.A. Thomas, Elements of Information Theory, Wiley, 1991. 4. Y. Freund, R.E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Proceedings of the 2th European Conference on Computational Learning Theory, 1995. 5. Y. Freund and R.E. Schapire, “Experiments with a New Boosting Algorithm,” International Conference on Machine Learning, pp. 148–156, 1996. 6. J.H. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: A statistical view of boosting,” Annals of Statistics, Vol.28, No.2, pp. 337–374, 2000. 7. D.D. Lewis, and J. Catlett, “Heterogeneous Uncertainty Sampling for Supervised Learning,”Proceedings of the 11th international Conference on Machine Learning, pp. 148–156, 1994. 8. M. Lindenbaum, S. Markovitch, and D. Rusakov, “Selective sampling for nearest neighbor classifiers,” American Association for Artificial Intelligence, 1999. 9. T.M. Mitchell, “Bayesian Learning,” Machine Learning, McGraw-Hill, pp. 154– 200, 1997. 10. K. Nigam, A. McCallum, S. Thrun, and T.M. Mitchell, “Learning to Classify Text from Labeled and Unlabeled Documents,” Proceedings of the 15th National Conference on Artificial Intelligence and the 10th Conference on Innovative Applications of Artificial Intelligence, pp. 792–799, 1998. 11. J.R. Quinlan, “Bagging, boosting, and C4.5,” Proceedings of the 13th National Conference on Artificial Intelligence, pp. 725–730, 1996. 12. D.D Lewis, “Reuters-21578 text categorization test collection,” http://kdd.ics.uci.edu/databases/reuters21578/, 1997. 13. R.E. Schapire, and Y. Singer, “BoosTexter: A Boosting-based System for Text Categorization,” Machine Learning, Vol.39, No.2, pp. 135–168, 2000. 14. K.M. Ting, and Z. Zheng, “A study of AdaBoost with Na¨ıve Bayesian Classifiers: Weakness and Improvement,” Computational Intelligence, Vol.19, No.2, pp.186– 200, 2003.

Suggest Documents