Bagged Voting Ensembles

2 downloads 33 Views 152KB Size Report
lymphotherapy. 82.06. 83.5. 80.67. 82.64. 79.14. 83.09. 83.24 monk1. 81.03. 73.35 *. 72.37 *. 72.42 *. 82.32. 96.54 vv. 94.36 vv mushroom. 100. 95.73**. 100.
Bagged Voting Ensembles S.B. Kotsiantis and P.E. Pintelas Educational Software Development Laboratory Department of Mathematics University of Patras, Hellas

Abstract. Bayesian and decision tree classifiers are among the most popular classifiers used in the data mining community and recently numerous researchers have examined their sufficiency in ensembles. Although, many methods of ensemble creation have been proposed, there is as yet no clear picture of which method is best. In this work, we propose Bagged Voting using different subsets of the same training dataset with the concurrent usage of a voting methodology that combines a Bayesian and a decision tree algorithm. In our algorithm, voters express the degree of their preference using as confidence score the probabilities of classifiers’ prediction. Next all confidence values are added for each candidate and the candidate with the highest sum wins the election. We performed a comparison of the presented ensemble with other ensembles that use either the Bayesian or the decision tree classifier as base learner and we took better accuracy in most cases.

1 Introduction In contrast to statistics where prior knowledge is expressed in the form of probability distributions over the observations and over the assumed dependencies data mining techniques make their prior knowledge explicit by restricting the space of assumed dependencies without making any distributional assumptions. Recently, the concept of combining data mining algorithms is proposed as a new direction for the improvement of the performance of individual algorithms. An ensemble of classifiers is a set of classifiers whose individual decisions are combined in some way to classify new instances. In [5] is provided an accessible and informal reasoning, from statistical, computational and representational viewpoints, of why ensembles can improve results. The main reason may be that the training data can not provide sufficient information for selecting a single best classifier from the set of hypotheses, because the amount of training data available is too small compared to the size of the hypothesis space. Several methods have been recommended for the design of ensemble of classifiers [5]. Mechanisms that are used to make ensemble of classifiers include: i) Using different subset of training data with a single data mining method, ii) Using different training parameters with a single training method, iii) Using different data mining methods. Even though, many methods of ensemble design have been proposed, there is as yet no clear picture of which method is best. This is in part because only a limited number of comparisons have been attempted and several of those have aggregated on com C. Bussler and D. Fensel (Eds.): AIMSA 2004, LNAI 3192, pp. 168–177, 2004. © Springer-Verlag Berlin Heidelberg 2004

Bagged Voting Ensembles

169

paring boosting to bagging [1, 4, 25]. Both boosting and bagging are generic techniques that can be employed with any base data mining technique. They operate by selectively re-sampling from the training data to generate derived training sets to which the base learner is applied. This paper explores an alternative method for constructing good ensembles that does not rely on a single data mining algorithm. The idea is very simple: use different subsets of the same training dataset with the concurrently usage of a voting methodology that combines a Bayesian and a decision tree algorithm. Using voting methodology, we expect to obtain better results based on the belief that the majority of experts are more likely to be correct in their decision when they agree in their opinion. In fact, the comparison with other ensembles that use either the Bayesian or decision tree algorithm on 30 standard benchmark datasets showed that the proposed ensemble had better accuracy on the average. Section 2 presents the most well- known methods for building ensembles, while section 3 discusses the proposed ensemble method. Experiment results and comparisons of the presented combining method in a number of datasets with other ensembles that also use as base learner either the decision tree or the Bayesian classifier are presented in section 4. We conclude in Section 5 with summary and further research topics.

2 Ensembles of Classifiers As we have already mentioned the concept of combining classifiers is proposed as a new direction for the improvement of the performance of individual classifiers. The goal of classification result integration algorithms is to generate more certain, precise and accurate system results. This section provides a brief survey of methods for constructing ensembles. Bagging [4] is a ``bootstrap'' ensemble method that creates individuals for its ensemble by training each classifier on a random redistribution of the training set. Each classifier's training set is generated by randomly drawing, with replacement, N examples - where N is the size of the original training set; many of the original examples may be repeated in the resulting training set while others may be left out. After the construction of several classifiers, taking a vote of the predictions of each classifier performs the final prediction. In [4], it was made the important observation that instability (responsiveness to changes in the training data) is a prerequisite for bagging to be effective. A committee of classifiers that all agree in all circumstances will give identical performance to any of its members in isolation. Bagging decision trees has been proved to be very successful for many data mining problems [1, 4, 25]. Another method that uses different subset of training data with a single data mining method is the boosting approach [7]. Boosting is similar in overall structure to bagging, except that keeps track of the performance of the data mining algorithm and concentrates on instances that have not been correctly learned. Instead of choosing the t training instances randomly using a uniform distribution, it chooses the training in-

170

S.B. Kotsiantis and P.E. Pintelas

stances in such a manner as to favor the instances that have not been accurately learned. After several cycles, the prediction is performed by taking a weighted vote of the predictions of each classifier, with the weights being proportional to each classifier’s accuracy on its training set. AdaBoost is a practical version of the boosting approach [7]. There are two ways that Adaboost can use these weights to construct a new training set to give to the base data mining algorithm. In boosting by sampling, examples are drawn with replacement with probability proportional to their weights. The second method, boosting by weighting, can be used with base data mining algorithms that can accept a weighted training set directly. With such algorithms, the entire training set (with associated weights) is given to the base data mining algorithm. Adaboost requires less instability than bagging, because Adaboost can make much larger changes in the training set. MultiBoosting [21] is another method of the same category that can be considered as wagging committees formed by AdaBoost. Wagging is a variant of bagging; bagging uses re-sampling to get the datasets for training and producing a weak hypothesis, whereas wagging uses re-weighting for each training example, pursuing the effect of bagging in a different way. A number of experiments showed that MultiBoost achieves greater mean error reductions than any of AdaBoost or bagging decision trees at both committee sizes that were investigated (10 and 100) [21]. Another approach for building ensembles of classifiers is to use a variety of data mining algorithms on all of the training data and combine their predictions. When multiple classifiers are combined using voting methodology, we expect to obtain good results based on the belief that the majority of experts are more likely to be correct in their decision when they agree in their opinion [9]. Another method for combining classifiers - called Grading- trains a meta-level classifier for each base-level classifier [19]. The meta-level classifier predicts whether the base-level classifier is to be trusted (i.e., whether its prediction will be correct). The base-level attributes are used also as meta-level attributes, while the meta-level class values are + (correct) and − (incorrect). Only the base-level classifiers that are predicted to be correct are taken and their predictions combined by summing up the predicted probability distributions. Stacked generalization [20], or Stacking, is another sophisticated approach for combining predictions of different data mining algorithms. Stacking combines multiple classifiers to induce a higher-level classifier with improved performance. In detail, the original data set constitutes the level zero data and all the base classifiers run at this level. The level one data are the outputs of the base classifiers. A data mining algorithm is then used to determine how the outputs of the base classifiers should be combined, using as input the level one data. It has been shown that successful stacked generalization requires the use of output class distributions rather than class predictions [20].

3 Presented Methodology The combination of two methods that agree everywhere cannot lead to any accuracy improvement no matter how ingenious a combination method is employed. Naïve

Bagged Voting Ensembles

171

Bayes (NB) [6] and C4.5 [15] differ a lot in their predictions since it was proved after a number of comparisons, that the methodology of selecting the best classifier of NB and C4.5 according to 3-cross validation gives better accuracy in most cases than bagging or boosting or multiboost NB and C4.5 [10]. Bagging uses a voting technique which is unable to take into account the heterogeneity of the instance space. When majority of the base classifiers give a wrong prediction for a new instance then the majority vote will result in a wrong prediction. The problem may consist in discarding base classifiers (by assigning small weights) that are highly accurate in a restricted region of the instance space because this accuracy is swamped by their inaccuracy outside the restricted area. It may also consist in the use of classifiers that are accurate in most of the space but still unnecessarily confuse the whole classification committee in some restricted areas of the space. To overcome this problem we have suggested the bagged sum rule voting using two data mining algorithms. When the sum rule is used each voter has to give a confidence value for each candidate. In our algorithm, voters express the degree of their preference using as confidence score the probabilities of classifiers prediction [20]. Next all confidence values are added for each candidate and the candidate with the highest sum wins the election. The algorithm is briefly described in Figure 1.

• • • •

MODEL GENERATION Let n be the number of instances in the training data. For each of t iterations (t=10 in our experiments): Sample n instances with replacement from training data. Built two classifiers (NB, DT) from the sample Apply sum voting of the data mining algorithms Store the resulting model. CLASSIFICATION For each of the t models: Predict class of instance using model. Return class that has been predicted most often. Fig. 1. The proposed ensemble

The time complexity increases to twice that of bagging. This time increase is, however, compensated for by the empirical fact that the Bagged Voting provides better generalization performance with much smaller ensemble size than bagging, In [12] is showed that the generalization error, E, of the ensemble can be expressed as E = e − d ; where e and d are the mean error and diversity of the ensemble respectively. This result implies that increasing ensemble diversity while maintaining the average error of ensemble members, should lead to a decrease in ensemble error. Bagged voting leads to larger diversity among member classifiers even though individual classifiers may not achieve an optimal generalization performance. For example, if the increase in diversity ( d ) is larger than the increase in average error ( e ) of individual classifier, we have a decrease in the generalization error of an ensemble. For this reason, the Bagged Voting scheme is useful in terms of ensemble diversity.

172

S.B. Kotsiantis and P.E. Pintelas

It has been observed that for bagging, an increase in committee size (subclassifiers) usually leads to a decrease in prediction error, but the relative impact of each successive addition to a committee is ever diminishing. Most of the effect of each technique is obtained by the first few committee members [1, 4, 9, 23]. For this reason, we used 10 sub-classifiers for the proposed algorithm. It must be also mentioned that the proposed ensemble can be easily parallelized. The computations required to obtain the classifiers in each bootstrap sample are independent of each other. Therefore we can assign tasks to each processor in a balanced manner. By the end each processor has obtained a part of the Bagged Voting ensemble. In the case we use the master-slave parallel programming technique, the method starts with the master splitting the work to be done in small tasks and assigning them to each slave (C4.5 and NB classifiers). Then the master performs an iteration in which if a slave returns a result (this means it finished its work) then the master assigns it another task if there are still tasks to be executed. Once all the tasks have been carried out the master process obtains the results and orders the slaves to finish since there are not more tasks to be carried out. This parallel execution of the presented ensemble can achieve almost linear speedup.

4 Comparisons and Results For the comparisons of our study, we used 30 well-known datasets mainly from domains from the UCI repository [2]. We have also used data from language morphological analysis (dimin) [3], agricultural domains (eucalyptus, white-colar) [13] and prediction of student dropout (student) [11]. In order to calculate the classifiers’ accuracy, the whole training set was divided into ten mutually exclusive and equal-sized subsets and for each subset the classifier was trained on the union of all of the other subsets. Then, cross validation was run 10 times for each algorithm and the median value of the 10-cross validations was calculated (10x10 cross-validation). It must be mentioned that we used the free available source code for algorithms by [22]. In the following tables, we represent with “vv” that the proposed ensemble (Bagged Voting) looses from the specific ensemble. That is, the specific algorithm performed statistically better than the proposed according to t-test with p