Feature Selection for Ensembles of Simple Bayesian ... - CiteSeerX

Feature Selection for Ensembles of Simple Bayesian Classifiers Alexey Tsymbal1, Seppo Puuronen1, David Patterson2 1

University of Jyväskylä, P.O.Box 35, FIN-40351 Jyväskylä, Finland {Alexey, Sepi}@jytko.jyu.fi 2 Northern Ireland Knowledge Engineering Laboratory, University of Ulster, U.K. [email protected] Abstract. A popular method for creating an accurate classifier from a set of training data is to train several classifiers, and then to combine their predictions. The ensembles of simple Bayesian classifiers have traditionally not been a focus of research. However, the simple Bayesian classifier has much broader applicability than previously thought. Besides its high classification accuracy, it also has advantages in terms of simplicity, learning speed, classification speed, storage space, and incrementality. One way to generate an ensemble of simple Bayesian classifiers is to use different feature subsets as in the random subspace method. In this paper we present a technique for building ensembles of simple Bayesian classifiers in random subspaces. We consider also a hill-climbing-based refinement cycle, which improves accuracy and diversity of the base classifiers. We conduct a number of experiments on a collection of real-world and synthetic data sets. In many cases the ensembles of simple Bayesian classifiers have significantly higher accuracy than the single “global” simple Bayesian classifier. We consider several methods for integration of simple Bayesian classifiers. The dynamic integration better utilizes ensemble diversity than the static integration.

1

Introduction

A popular method for creating an accurate classifier from a set of training data is to train several classifiers, and then to combine their predictions [5]. Previous theoretical and empirical research has shown that an ensemble is often more accurate than any of the single classifiers in the ensemble [1,5,8,11]. The ensembles of simple Bayesian classifiers have traditionally not been a focus of research. One reason for that is that simple Bayes relies on an assumption that is rarely valid in practical learning problems: that the features are independent of each other, given the predicted value. Another reason is that simple Bayes is an extremely stable learning algorithm, and many ensemble techniques are mostly variance reduction techniques, thus not being able to benefit from its integration [1]. However, it has been recently shown that simple Bayes can be optimal even when the independence assumption is violated by a wide margin [6]. Besides, simple Bayes can be effectively used in ensemble techniques, which perform also bias reduction, such as boosting. For example, Elkan’s application of boosted simple Bayes won first place out of 45 entries in the data mining competition KDD’97 [7]. Besides, when simple Bayes is applied to sub-problems of lower dimensionalities as in the random sub-

space method [9], the error bias of the Bayesian probability estimates caused by the independence assumption becomes less. One successful application of ensembles of simple Bayesian classifiers built on different feature subsets to word sense disambiguation was presented in [13]. In this paper we present a technique for building ensembles of simple Bayesian classifiers in random subspaces. We consider also a hill-climbing-based refinement cycle, which improves accuracy and diversity of the base classifiers. We conduct a number of experiments on a collection of real-world and synthetic data sets. We compare two commonly used static integration techniques such as static selection and weighted voting [5] with three recently proposed techniques for dynamic integration of classifiers such as dynamic selection, dynamic voting, and dynamic voting with selection [14,17]. The dynamic integration is based on estimation of local accuracies of the base classifiers. We show that the dynamic integration better utilizes ensemble diversity. The paper is organized as follows. In Section 2 the general problem of construction of an ensemble of classifiers is considered. In Section 3 we review the random subspace method. In Section 4 we present our algorithm for feature selection in ensembles of simple Bayesian classifiers. In Section 5 experiments with this algorithm are considered. We conclude briefly in Section 6 with a summary and further research topics.

2

An Ensemble of Classifiers

Both theoretical and empirical research has shown that an effective ensemble should consist of a set of models that do not only have high classification accuracy, but also make their errors on different parts of the input space [1,3,8,11]. Obviously, combining several identical classifiers produces no gain. Brodley and Lane [3] show that main objective when generating the base classifiers is to maximize the coverage of the data, which is the percentage of the data that at least one base classifier can classify correctly. Achieving coverage greater than the accuracy of the best base classifier requires diversity among the base classifiers. Several researchers have presented theoretical evidences supporting this claim [8,11]. In this paper, to measure the disagreement of a base classifier i and the whole ensemble, we calculate its diversity Divi on test instances as the average difference in predictions in all the pairs of classifiers containing i:

Div i ?

?

M j ?1

?

S k ? 1, k ? i

Dif ?hi ?x j ?, hk ?x j ??

M ??S ? 1?

,

(1)

where S is the number of the base classifiers, h i(xj) is the classification of the instance xj by the classifier h i, and Dif(a,b) is zero if the classifications a and b are the same and one if they are different, and M is the number of instances in the test set. The total diversity of the ensemble can be defined as the average diversity of its members. The second important aspect of creating an effective ensemble is the choice of the method for integration of the predictions of the base classifiers [3,5]. Brodley and Lane

[3] have shown that only increasing coverage of an ensemble through diversity is not enough to insure increased prediction accuracy – if the integration method does not utilize the coverage, then no benefit arises from integrating multiple models. Thus, diversity and coverage of an ensemble is not a sufficient condition for the ensemble accuracy. It is also important for the ensemble accuracy to have a good integration procedure that will utilize diversity of the base classifiers. The most commonly used integration techniques are voting-based schemas for discrete predictors [1] and simple and weighted averaging for numeric and a posteriori probability predictors [11]. These techniques are simple and well studied both in theory [8,11] and experimentally [1]. However, the voting and averaging ensembles do not take into account local expertise of the base classifiers. Classification accuracy can be significantly improved with an integration technique that is capable of identifying the regions of expertise of the base classifiers (e.g. with the dynamic integration) [14,17].

3

The Random Subspace Method

One effective approach for generation of a set of base classifiers is ensemble feature selection [12]. Ensemble feature selection is finding a set of feature subsets for generation of the base classifiers for an ensemble with one learning algorithm. Ho [9] has shown that simple random selection of feature subsets may be an effective technique for ensemble feature selection. This technique is called the random subspace method (RSM) [9]. In the RSM, one randomly selects N*(acc+ ? div) then accept FS[i] | || | and update FS, acc, and div | || | else restore previous feature subset FS[i] | || end for N | | until no_changes | end for S end algorithm EFS_SBC _________________________________________________________

Fig. 1. Outline of the EFS_SBC algorithm for ensemble feature selection with simple Bayes

In the next section we present experiments with EFS_SBC. The diversity is calculated using (1), but only for already refined base classifiers. Thus, diversity for the first refined feature subset is always zero, and the fitness in this case represents only accuracy.

5

Experiments

In this section experiments with our algorithm EFS_SBC are presented. First, the experimental setting is described, and then, results of the experiments are presented. The experiments are conducted on 21 data sets taken from the UCI machine learning repository [2]. These data sets include real-world and synthetic problems, vary in characteristics, and have been investigated by previous researchers. For each data set 30 test runs of EFS_SBC are made. In each run the data set is first split into the training set, the validation set, and the test set by random sampling. The sampling is stratified so that the class distributions of instances in each set are approximately the same as in the initial data set. Each time 60 percent instances are picked up to the training set. The rest 40 percent instances of the data set are divided into two approximately equal test sets (VS and TS). We experimented with six different values of the diversity coefficient ? : 0, 0.25, 0.5, 1, 2, and 4. The size of ensemble S was selected to be equal to 25. It was shown that for many ensemble types, the biggest gain in accuracy is achieved already with this number of base classifiers [1]. At each run of the algorithm, we collect accuracies for the five types of integration of classifiers [14,17]: Static Selection (SS), Weighted Voting (WV), Dynamic Selection (DS), Dynamic Voting (DV), and Dynamic Voting with Selection (DVS). In the dynamic integration strategies DS, DV, and DVS [14,17], the number of nearest neighbors for the local accuracy estimates was pre-selected from the set of six values: 1, 3, 7, 15, 31, 63, for each data set. The test environment was implemented within the MLC++ framework (the machine learning library in C++) [10]. For the simple Bayesian classifier, the numeric features were discretized into ten equal-length intervals (or one per observed value, whichever was less), as it was done in [6]. Although this approach was found to be slightly less accurate than more sophisticates ones, it has the advantage of simplicity, and is sufficient for comparing different ensembles of simple Bayesian classifiers. A multiplicative factor of 1 was used for the Laplace correction in simple Bayes [6]. 0.95 0.9 Aver

0.85

Cover 0.8

Static Dynamic

0.75 0.7 0.65 RSM

0

0.25

0.5 alpha

1

2

4

Fig. 3. Main average ensemble characteristics for the random RSM ensembles and different ?

In Figure 3, the average accuracies of the base classifiers (Aver), the average ensemble coverages (Cover), and the average accuracies of static (SS and WV, Static) and dynamic (DS, DV, and DVS, Dynamic) integration techniques are presented for the initial RSM ensembles and for the refined ensembles with different ? . One can see that the initial RSM ensembles show very good results, because the lack of accuracy in the ensemble members (0.716) is compensated for by the ensemble coverage (0.945). This supports the results presented in [9, 16]. As could be expected, the ensemble coverage grows (0.891-0.959) and the average base classifiers’ accuracy drops (0.7910.659) with the growth of alpha. One important conclusion that can be done is that the dynamic integration better utilizes the ensemble coverage. As it was shown in [15], the dynamic integration of base classifiers built on different feature subsets implicitly performs local feature selection (DS) or local feature weighting (DV and DVS). When the average base accuracy drops, the accuracy of static integration drops significantly as well, but the accuracy of dynamic integration even grows in many cases, and the difference between the static and dynamic approaches grows from 0.008 for ? =0 to 0.045 for ? =4. The best accuracy on average is achieved with the dynamic integration when ? =2, 0.827. However, the optimal ? is different for different data sets. In Table 1 the experimental results for iterative refinement on the 21 data sets are presented. The table includes the names of the data sets, the best ? (alpha), average accuracies of the base classifiers (Aver), accuracies for the five integration techniques (SS, WV, DS, DV, DVS), accuracies of simple Bayes on the whole feature sets (Bayes), the average relative numbers of features selected (feat), and improvements of the refinement cycle in comparison with the initial random ensembles (impr). The best accuracies of integration techniques are given in italic, and significantly better results than those of single simple Bayes are given also in bold (the statistical significance is checked with the 1tailed Student t-test with 0.95 level of significance). Statistically significant improvements over the initial random ensembles are given in bold (the last column). Table 1. Results of the iterative refinement of feature subsets Data set Balance Breast Car Diabetes Glass Heart Ionosphere Iris Led Led17 Liver Lymph Monk1

alpha

Aver

SS

WV

DS

DV

DVS

2

0.719

0.893

0.899

0.901

0.901

0.903

Bayes

0.900 0.50

feat

0.002

1 2 1 4 1 1 2 1 0.25 1 2 4

0.725 0.779 0.723 0.475 0.777 0.894 0.800 0.618 0.662 0.588 0.732 0.569

0.729 0.836 0.761 0.574 0.815 0.895 0.913 0.734 0.670 0.614 0.822 0.756

0.744 0.819 0.755 0.608 0.832 0.909 0.889 0.746 0.700 0.620 0.844 0.663

0.739 0.903 0.757 0.609 0.810 0.899 0.931 0.757 0.670 0.613 0.818 0.925

0.752 0.855 0.755 0.679 0.830 0.914 0.920 0.748 0.702 0.633 0.852 0.811

0.752 0.893 0.756 0.679 0.833 0.914 0.922 0.748 0.693 0.622 0.859 0.879

0.742 0.846 0.756 0.586 0.832 0.901 0.891 0.757 0.648 0.623 0.846 0.756

0.001 0.016 0.000 0.055 0.002 0.004 0.014 0.004 0.052 0.004 0.012 0.008

0.50 0.60 0.48 0.33 0.52 0.55 0.47 0.72 0.60 0.45 0.46 0.43

impr

Monk2 Monk3 Soybean Thyroid Tic Vehicle Vote Zoo Average

0.25 2 1 2 4 4 0.5 2 1.810

0.664 0.797 0.949 0.878 0.676 0.423 0.923 0.773 0.721

0.662 0.973 0.993 0.955 0.690 0.569 0.941 0.927 0.796

0.667 0.973 1.000 0.940 0.718 0.594 0.935 0.948 0.800

0.664 0.985 0.993 0.967 0.947 0.657 0.951 0.935 0.830

0.667 0.985 1.000 0.961 0.820 0.679 0.935 0.953 0.826

0.664 0.985 1.000 0.961 0.935 0.688 0.946 0.950 0.837

0.625 0.973 1.000 0.960 0.707 0.592 0.898 0.925 0.798

0.51 0.001 0.58 -0.002 0.52 0.000 0.54 -0.006 0.40 0.056 0.19 0.057 0.49 0.009 0.49 0.011 0.49 0.014

From Table 1 one can see that on 14 out of 21 data sets the ensembles are better with statistical significance than single Bayes. In all these cases, dynamic approaches are the best. For example, on MONK-1, DS improves the single simple Bayes by 17%. Bad performance of simple and boosted Bayes over the first two Monk’s problems is discussed in [7]. The initial random ensembles are improved with statistical significance on only 10 data sets. The best improvement is 0.057 on the Vehicle data set. This again shows good performance of the random subspace method. The best integration technique on average is DVS. It is quite stable, as it combines the power of dynamic selection and dynamic voting.

6

Conclusion

One way to construct an ensemble of diverse classifiers is to use feature subsets generated by the random subspace method. Ensembles of this type can produce very good results, because the lack of accuracy in the ensemble members is compensated for by the diversity. However, generating a set of diverse base classifiers with good coverage is not enough to insure increased prediction accuracy. If the integration method does not utilize the coverage, then no benefit arises from integrating multiple models. In this paper we presented an algorithm for ensemble feature selection with simple Bayesian classifiers. We considered a hill-climbing-based refinement cycle, which improved accuracy and diversity of the base classifiers built with the random subspace method. We conducted a number of experiments on a collection of data sets. In many cases the ensembles of simple Bayesian classifiers had higher accuracy than the single “global” simple Bayesian classifier. We compared two commonly used static integration techniques with three recently proposed techniques for dynamic integration of classifiers. We have shown that the dynamic integration better utilizes the diversity of the base classifiers. In future research it would be interesting to compare the performance of the presented algorithm with the genetic-algorithm-based approach of Opitz [12]. Presumably, the power of genetic algorithms won’t give much gain in this case, as the random ensembles have already quite good performance. Besides, the genetic algorithm is more computationally expensive. Another interesting future research direction is the comparison of the presented algorithm with the boosted simple Bayes. Acknowledgments: This research is supported by the COMAS Graduate School of the University of Jyväskylä. We would like to thank the UCI machine learning repository

of databases, domain theories and data generators for the data sets, and the machine learning library in C++ for the source code used in this study. We are thankful to the anonymous referees for their valuable comments and constructive criticism.

References 1. Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning, Vol. 36, Nos. 1,2 (1999) 105-139. 2. Blake, C.L., Merz, C.J.: UCI repository of machine learning databases [http:// www.ics.uci.edu/ ~mlearn/ MLRepository.html]. Dep-t of Information and CS, Un-ty of California, Irvine CA (1998). 3. Brodley, C., Lane, T.: Creating and exploiting coverage and diversity. In: Proc. AAAI-96 Workshop on Integrating Multiple Learned Models (1996) 8-14. 4. Cunningham, P.: Diversity versus quality in classification ensembles based on feature selection. Tech. Report TCD-CS-2000-02, Dept. of Computer Science, Trinity College Dublin, Ireland (2000). 5. Dietterich, T. G.: Ensemble Learning Methods. In: M.A. Arbib (ed.), Handbook of Brain Theory and Neural Networks, 2nd ed., MIT Press (2001). 6. Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zeroone loss. Machine Learning, Vol. 29, Nos. 2,3 (1997) 103-130. 7. Elkan C.: Boosting and naïve Bayesian learning. Tech. Report CS97-557, Dept. of CS and Engineering, Un-ty of California, San Diego, USA (1997). 8. Hansen, L., Salamon, P.: Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.12 (1990) 993-1001. 9. Ho, T. K.: The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 8 (1998) 832-844. 10. Kohavi, R., Sommerfield, D., Dougherty, J.: Data mining using MLC++: a machine learning library in C++. Tools with Artificial Intelligence, IEEE CS Press (1996) 234-245. 11. Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation, and active learning. In D. Touretzky, T. Leen (eds.), Advances in Neural Information Processing Systems, Vol. 7, Cambridge, MA, MIT Press (1995) 231-238. 12. Opitz, D.: Feature selection for ensembles. In: Proc. 16th National Conf. on Artificial Intelligence, AAAI (1999) 379-384. 13. Pedersen, T.: A simple approach to building ensembles of naive Bayesian classifiers for word sense disambiguation. In: Proc. 1st Annual Meeting of the North American Chapter of the Association for Computational Linguistics, Seattle, WA (2000) 63-69. 14. Puuronen, S., Terziyan, V., Tsymbal, A.: A dynamic integration algorithm for an ensemble of classifiers. In: Z.W. Ras, A. Skowron (eds.), Foundations of Intelligent Systems: ISMIS’99, Lecture Notes in AI, Vol. 1609, Springer-Verlag, Warsaw (1999) 592-600. 15. Puuronen, S., Tsymbal, A.: Local feature selection with dynamic integration of classifiers, In: Fundamenta Informaticae, Special Issue “Intelligent Information Systems”, Vol. 47, Nos. 1-2, IOS Press (2001) 91-117. 16. Skurichina, M., Duin, R.P.W.: Bagging and the random subspace method for redundant feature spaces. In: J. Kittler, F. Roli (eds.), Proc. 2nd Int. Workshop on Multiple Classifier Systems MCS 2001, Cambridge, UK (2001) 1-10.

17. Tsymbal, A., Puuronen, S., Skrypnyk, I.: Ensemble feature selection with dynamic integration of classifiers, In: Proc. Int. ICSC Congress on Computational Intelligence Methods and Applications CIMA’2001, Bangor, Wales, U.K. (2001).

Feature Selection for Ensembles of Simple Bayesian ... - CiteSeerX

Feature Selection for Ensembles of Simple Bayesian ... - CiteSeerX

Suggest Documents

3 Feature Selection for Ensembles Using the Multi ... - CiteSeerX

Feature Selection with Ensembles, Artificial ... - Machine Learning

Software for Bayesian Classification and Feature Selection - CiteSeerX

Feature Selection for SVMs - CiteSeerX

Feature Selection for Ranking - CiteSeerX

Learning Bayesian Networks Using Feature Selection ...

Ensembles of Instance Selection Methods based on Feature Subset

Feature Selection with Ensembles, Artificial Variables - Journal of ...

Nonparametric Bayesian Clustering Ensembles

A New MDL-based Function for Feature Selection for Bayesian ...

logitboost of simple bayesian... - CiteSeerX

Bayesian Neural Network Ensembles

A simple Bayesian algorithm for feature ranking in ...

Sparse Bayesian classification and feature selection for ... - PLOS

A Sparse Bayesian Approach for Joint Feature Selection and ... - UOC

Fast Bayesian Feature Selection for High Dimensional Linear ...

A Bayesian non-Linear Method for Feature Selection in Machine

Selective Bayesian Classifier: Feature Selection for the ... - UCR CS

approaches for bayesian variable selection - CiteSeerX

Feature Selection for Brain-Computer Interfaces - CiteSeerX

Feature Selection for Partial Discharge Diagnosis - CiteSeerX

Feature Band Selection for Multispectral Palmprint ... - CiteSeerX

Optimizing Feature Selection for Recognizing Handwritten ... - CiteSeerX

Feature Selection for Modeling Intrusion Detection - CiteSeerX