2013 5th Conference on Information and Knowledge Technology (IKT)
A New Ensemble Classifier Creation Method by Creating New Training Set for Each Base Classifier Jalil Ghavidel, Sasan Yazdani, Morteza Analoui School of Computer Engineering Iran University of Science & Technology Tehran, Iran
[email protected],
[email protected],
[email protected] using Majority vote. Many forms of majority voting algorithms have been proposed [5]. Given that the data distribution in a bootstrap is very close to the original dataset, Bagging’s base classifiers are relatively accurate. Slight difference in training samples of each classifier is only factor encouraging diversity in this method. Consequently, Bagging uses an unstable learner to produce base classifiers [8]. A learner is called unstable if small changes in training set lead to significant changes in the learned classifier. Although, Bagging uses unstable base classifiers, but the created diversity in the ensemble is not adequate. Therefore, in order to increase Bagging’s accuracy, many classifiers must be employed to produce the required diversity [7]. Random Forest [9] and Wagging [10] are two most popular variants of the original Bagging ensemble.
Abstract—Base classifier’s classification error and diversity are key factors in performance of ensemble methods. There is usually a trade-off between classification error and diversity in ensemble methods. Decreasing classification error of base classifiers usually makes them less diverse while increasing diversity, results in less accurate base classifiers. This paper proposes a new ensemble classifier generation method which aims to create more diverse base classifiers while making them more accurate. In this approach, training data for base classifiers are built by taking a bootstrap sample of the original training set and then manipulating a set of arbitrary attributes of each pattern. We experimented our ensemble of classifiers on 15 UCI data sets and were able to outperform Bagging, Boosting and Rotation Forest. Moreover, Wilcoxon signed rank test confirms our claim and shows that the proposed method is significantly better than other three methods on these data sets.
Boosting [6] is another ensemble of classifiers that has gained a lot of attention in the recent years and various versions of it have been proposed [6], [11], [12], [13], [14], [15]. The most famous version of Boosting is Adaboost. This method combines weak classifiers to make a strong one. A weak classifier is a classifier that performs slightly better than random guess i.e. a classifier that has the accuracy slightly above 50%. The main idea is to give more attention to samples which are harder to classify correctly. Training samples for classifiers used in Adaboost are chosen based on their assigned weights. Initially, all training samples are assigned with equal weights. After each iteration, correctly classified patterns will be assigned with smaller weights while misclassified sample’s weight are increased in order for them to have a higher chance to be included in next classifier’s training set. Furthermore, each classifier in Adaboost has its corresponding weight. More accurate classifiers are assigned with higher weight making them more prominent in vote. Boosting builds classifiers in such a way that they can complement each other so that if a pattern is misclassified by a base classifier in Boosting, there is a good chance that other ensemble members classify this pattern correctly. It is shown [16], [17] that Boosting can approximate large margin classifiers like SVM (Support Vector Machine). Despite its good performance, Adaboost has some drawbacks. When the number of training samples is insufficient Adaboost usually does not perform well [18]. Moreover, since Adaboost focuses on misclassified patterns, incorrect pattern labels in training set can degrade its final accuracy [19].
Keywords-Ensemble Classifiers; Adaboost, Bagging; Boosting; Rotation Forest
I.
INTRODUCTION
In recent years, various ensemble creation methods have been proposed, making it the hallmark of machine learning and pattern recognition [1], [2], [3], [4]. Ensemble methods create base classifiers and combine their outputs, usually by voting, to get better classification accuracy. Ensembles try to achieve better classification results by employing diverse classifiers. Diversity can be injected in an ensemble by using different types of classifiers, manipulating base learner algorithm or by using different training samples to train each classifier [5]. There is an inverse relationship between classification accuracy and diversity of base classifiers in an ensemble classifier i.e. decreasing classification error of base classifiers usually makes them less diverse and vice versa. Therefore an optimal ensemble must compromise and find the best threshold for the aforementioned trade-off. Bagging [4], Boosting [6] and Rotation Forest [7] are three of the most famous ensemble creation methods. Bagging, randomly selects a number of patterns from the original training set with replacement. New training set has the same number of patterns as the original one with some repeated and some unused patterns. This new training set is called a “Bootstrap Replicate” of the original training set. After training, classifiers ensemble classifies unknown samples by
978-1-4673-6490-4/13/$31.00 ©2013 IEEE
290
Rotation Forest [7] is another high performance ensemble. This method tries to create base classifiers that are both diverse and accurate. Decision tree learner is sensitive to feature axis rotation. Rotation Forest takes advantage of this fact to create diversity among ensemble members. The goal is to produce base classifiers which are more accurate than base classifiers in Adaboost and more diverse than Bagging’s base classifiers. To achieve this goal the feature set is randomly split into series of subsets. For each subset, some classes are randomly chosen and all of their samples are discarded. Afterwards, Principal Component Analysis (PCA) is applied to each subset. Then all principal components are put together to create a matrix named “rotation matrix”. This matrix is used to rotate feature axis. Then, the whole data set is transformed to a new space and a classifier is created based on this new dataset. Finally, Majority voting is used to combine classifiers output. So Rotation Forest uses the whole data set to train base classifiers, which causes relatively high accuracy and performs axis rotation for its source of diversity. RotBoost [20] is another derivative of Rotation Forest that combines Rotation Forest and Adaboost to reach higher performance.
In each iteration, a new classifier is created using the new training set and added to the ensemble. Afterwards, ensemble error is calculated on the original training set. The new base classifier is kept as a member if it does not increase ensemble’s error or else it is discarded. In the first iteration, ensemble error is assumed to be 100%. Given the fact that many produced base classifiers can be discarded, a limit has been set for the number of iterations of the algorithm. Our proposed method keeps creating base classifiers until ensemble size reaches a predefined limit, or maximum number of iteration is reached. Majority vote is used to produce ensemble decision. Fig. 1 shows the pseudo code for the proposed algorithm. Since feature average mi is calculated on a bootstrap of the original training set, each iteration has its own different value causing different influence on the training set. Thus, diversity is achieved in two ways: First, bootstrapping the original training set and second, adding random values to some of feature values. Randomness is a good source of diversity but diversity is not always realized by randomness. Therefore, there should be mechanism to create a balance between diversity caused by randomness and overall accuracy of the ensemble. Our proposed method employs the aforementioned selection process [3] to meet this goal. Selection process implicitly chooses the base classifiers that can work together and cover each other’s shortcomings.
Base classifiers in Bagging and Rotation Forest are independent i.e. they can be learned simultaneously. Adversely in Adaboost, base classifiers are dependent and must be learned one after another. In this paper, we propose a new dependent method for creating ensemble classifiers. To realize diversity, this method produces a new training set by manipulating bootstrap samples of the original training set. Bootstrap samples are altered by adding relatively random values to a set of arbitrary attributes for every pattern. To achieve high accuracy, this method keeps only base classifiers that do not have negative impact on ensemble training accuracy.
III.
In our experiments, we employed Decision Tree as base learner. All methods have been implemented in Matlab using classregtree. The performance of proposed method is examined using 15 UCI [21] classification datasets, compared with Bagging, Adaboost.M1 and Rotation Forest. Table 1 shows the characteristics of these data sets. The first column shows the name of the corresponding dataset. Second column lists the number of classes in each dataset. Third column shows the number of patterns in the dataset and the last two columns summarize number of numerical and categorical features in each data set.
The rest of this paper is organized as follows: Section 2 describes the proposed method. In section 3 we compare our proposed method with Bagging, Boosting and Rotation Forest and present experimentation results on 15 UCI repository datasets. Finally in section 4, materials stated in this paper are concluded. II.
PROPOSED METHOD
TABLE I.
T
Let x = [ x1 , x2 , ..., xn ] represent a train pattern with n features in training set X consisting of N training samples and F = { f1, f 2 , ..., f n } represent the feature set. To create the training set for a base classifier the following steps are carried out: •
Draw a bootstrap sample X' of the training set.
•
Randomly Select P percent of F’s members, the selected members make a new set called F'.
•
For each feature fi in F' calculate the average value of fi over X' and denote it by mi.
•
For each pattern in X', choose a random value from [−cmi ...cmi ] and add it to the corresponding feature value. Where c is a parameter and its optimal value can be found by cross-validation.
EXPERIMENTAL RESULTS
Data set Breast-Cancer Wa-Breast Car CMC Credit-rating German-Credit Dermatology Haberman Lymphography Segment Vote WDBC Wine Yeast Zoo
CHARACTERISTICS OF THE USED DATA SETS
Classes
Objects
2 2 4 3 2 2 6 2 4 7 2 2 3 10 7
286 699 1748 1473 690 1000 366 306 148 2310 435 569 178 1484 101
Discrete Features 10 0 6 7 9 13 0 0 15 0 16 0 0 0 16
Continues Features 0 9 4 2 6 7 34 3 3 19 0 30 13 8 2 a. Wisconsin
291
For each data set ten 10-fold cross-validations were performed and their accuracies are reported in Table 2. For each dataset, the method that has the highest accuracy is listed in bold. Maximum ensemble size for our proposed method was considered 10. The same settings were applied to Rotation Forest, Bagging, and Adaboost. For the proposed method maximum number of iterations and P has been respectively set to 50 and 70%. As it can be seen in Table 2 •
In 10 data sets proposed method’s accuracy is higher than three other methods.
•
In 15 data sets proposed method’s accuracy is better than Bagging.
•
In 11 data sets proposed method outperforms Adaboost.
•
In 12 data sets proposed method is better than Rotation Forest.
The last row of table 2 reports the average accuracy of methods on all 15 data sets. Proposed method with 85.06% has a better accuracy than the other three methods. Table 3 represents the rank of each ensemble for each dataset. The last row of this table shows each method’s average rank over all datasets. Proposed method with 1.53 has the best average rank among all five methods. To statistically compare these methods, Wilcoxon signed rank test [22] has been used. Wilcoxon signed rank test is a nonparametric statistical test that uses ranking to compare two methods. With 95% significance level it shows that proposed method statistically outperforms other three methods on these 15 data sets. We used Kappa-Error [23] diagram to visualize ensembles and compare their base classifiers’ diversity. TABLE II. Data set Breast-Cancer W-Breast Car CMC Credit-rating German-Credit Dermatology Haberman Lymphography Segment Vote WDBC Wine Yeast Zoo Average
2. Figure 1. Pseudo code for the proposed method
292
CLASSIFICATION ACCURACY OF THE METHODS
Single Decision Tree 67.66 94.95 95.37 51.38 84.12 69.30 94.75 68.05 77.85 95.78 95.72 92.46 90.40 53.54 89.35 81.37
Bagging
Adaboost
Rotation Forest
Proposed Method
72.06 96.18 96.42 53.03 86.94 74.41 96.08 70.80 81.56 96.94 95.95 94.69 95.26 58.58 90.81 83.98
72.75 96.34 98.02 53.17 86.63 73.23 96.48 70.10 83.56 97.98 95.60 95.99 96.06 59.14 92.68 84.51
71.15 96.79 96.82 52.33 86.47 74.87 95.75 69.19 80.67 97.61 95.87 96.50 96.95 59.21 90.35 84.03
74.51 96.25 96.93 52.95 87.83 75.84 96.71 71.45 83.87 97.18 96.00 96.68 96.29 59.77 93.65 85.06
TABLE III. Data set Breast-Cancer W-Breast Car CMC Credit-rating German-Credit Dermatology Haberman Lymphography Segment Vote WDBC Wine Yeast Zoo Average
METHODS RANK OF THE DATA SETS
Single Decision Tree 5 5 5 5 5 5 5 5 5 5 4 5 5 5 5 4.93
Bagging
Adaboost
Rotation Forest
Proposed Method
3 4 4 2 2 3 3 2 3 4 2 4 4 4 3 3.13
2 2 1 1 3 4 2 3 2 1 5 3 3 3 2 2.46
4 1 3 4 4 2 4 4 4 2 3 2 1 2 4 2.93
1 3 2 3 1 1 1 1 1 3 1 1 2 1 1 1.53
Fig. 2 shows the Kappa-Error diagram for the credit-rating data set. Average value of ț for all base classifier pairs and average error of all classifier are shown in the upper right corner of each diagram. Ensemble size has been set to 50 for all methods so every diagram contains 50 × (50 - 1) / 2 = 1225 points. As expected and can be seen in the Fig. 2 Adaboost with average ț = 0.23024 has the most diverse base classifiers. Bagging with ț = 0.28791, Rotation Forest with ț = 0.29261 and our proposed method with ț = 0.29501 respectively have the second, third and fourth ranks. On the other hand, our proposed method has the most accurate base classifiers with average Error = 0.1542 and Rotation Forest with Average Error = 0.16168, Bagging with Average Error = 0.16656 and Adaboost with Average Error = 0.30779 have gained next ranks. Diversity difference between Rotation Forest and our proposed method is negligible but difference between their base classifier average errors cannot be ignored. This results the difference between these ensemble accuracies. Although Adaboost and Bagging ensembles are more diverse than Rotation Forest and our proposed method, but their lack of accuracy has made them less accurate.
Kappa-Error diagram is a cloud of L(L-1)/2 points with L representing the ensemble size. Each of these points corresponds to a pair of base classifiers. The y coordinate shows the pair’s average error and the x coordinate denotes the kappa value for the corresponding pair. Kappa is a pair-wise diversity measure that is defined as follows: L
κ=
L
L
Fig. 3 investigates influence of ensemble size on the performance of the methods on Breast-Cancer data set. Ensemble size has been changed from 1 to 30 and as it can be seen, proposed method retained its dominance over the three other methods for different ensemble sizes.
L
¦ Cii / m − ¦ ( ¦ (Cis / m) ¦ (Cki / m))
i =1
i =1 s =1 k =1 L L 1 − ¦ ( ¦ (Cis / m) ¦ (Cki / m)) i =1 s =1 k =1 L
IV.
CONCLUSIONS AND FUTURE WORK In this paper we propose a new method for creating ensemble classifiers. Our proposed method creates diversity among base classifiers by taking a bootstrap from the original dataset and adding relatively random values to a set of arbitrary attributes for each training pattern. A selection mechanism is used to choose between produced base classifiers. Wilcoxon signed rank test shows that this method outperforms Bagging, Boosting and Rotation Forest on 15 UCI repository real world datasets.
(1)
Where m is the size of the dataset and Cij is the number of samples that the first classifier has labeled as class ¹i and the second classifier has labeled as class ¹j. « = 1 shows that two classifiers agree on all of samples and « = 0 indicate that two classifier only agree by chance. So small ț shows that two classifiers are more diverse while large values show they decide more identically.
Figure 2. Kappa-Error Diagram for the credit-rating data set. Average value of ț for all base classifier pairs and average error of all classifier has shown in the upper right corner of each diagram.
Figure 3. Ensemble methods classification accuracy for different ensemble sizes
293
Future works can include: •
Changing the selection method of base classifiers. For example soft computing methods can be used.
•
Examining the effect changing base classifier learner from decision tree to other and more complex learner methods like neural network and etc.
[11] Y. Freund and R. E. Schapire, "Experiments with a new boosting algorithm. In Machine Learning," in International Conference on machine Learning, 1996, pp. 332-352. [12] L. Breiman, "Arcing classifiers," Annals of Statistics, vol. 26, no. 3, pp. 801-849, 1998. [13] J. Friedman, T. Hastie, and R. Tibshirani, "Additive Logistic Regression: a Statistical View of Boosting," Annals of Statistics, vol. 28, no. 2, pp. 337-407, 2000. [14] J. Friedman, "Stochastic gradient boosting," Computational Statistics & Data Analysis, vol. 38, no. 4, pp. 368-378, February 2002. [15] T. Phama and A. Smeuldersb, "Quadratic boosting," Pattern Recognition, vol. 41, pp. 331-341, 2008. [16] R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee, "Boosting the margin: A new explanation for the effectiveness of voting methods," The Annals of Statistics, vol. 26, no. 5, pp. 1651–1686, October 1998. [17] C. Rudin, I. Daubechies, and R.E. Schapire, "The Dynamics of Adaboost: Cyclic behavior and convergence of margins," Journal of Machine Learning Research, vol. 5, pp. 1557-1595, 2004. [18] R.E. Schapire, "Theoretical views of boosting and applications," in Proceedings of the Tenth International Conference on Algorithmic Learning Theory, 1999, pp. 13-25. [19] T. G. Dietterich, "Ensemble methods in machine learning," in Lecture Notes in Computer Science, vol. 1857, Cagliari, Italy, 2000, pp. 1-15. [20] Chun-Xia Zhang and Jiang-She Zhang, "RotBoost: A technique for combining Rotation Forest and AdaBoost," Pattern Recognition Letters, vol. 29, no. 10, pp. 1524-1536, July 2008. [21] C. L. Blake and C. J. Merz. UCI repository of machine learning databases. [Online]. http://archive.ics.uci.edu/ml/
REFERENCES [1]
Nima Hatami, "Thinned-ECOC ensemble based on sequential code shrinking," Expert Systems with Applications, vol. 39, no. 1, pp. 936947, January 2012. [2] Nan-Chen Hsieh and Lun-Ping Hung, "A data driven ensemble classifier for credit scoring analysis," Expert Systems with Applications, vol. 37, no. 1, pp. 534-545, January 2010. [3] P. Melville and R. J. Mooney, "Constructing diverse classifier ensembles using artificial training examples," in International Joint Conference on Artificial Intelligence, 2003, pp. 505-512. [4] L. Breiman, "Bagging predictors," Machine Learning, vol. 24, no. 2, pp. 123-140, 1996. [5] Lior Rokach, Pattern classification using ensemble methods.: World Scientific, 2010. [6] Y. Freund and R.E. Schapire, "A decision-theoretic generalization of On-Line learning and an application to Boosting," in Computational Learning Theory, Paul Vitányi, Ed.: Springer Berlin / Heidelberg, 1995, pp. 23-37. [7] J. J. Rodriguez, L. I. Kuncheva, and C. J. Alonso, "Rotation Forest: A New Classifier Ensemble Method," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, pp. 1619 – 1630, 2006. [8] J.R. Quinlan, "Bagging, Boosting, and C4.5," in Proceedings of the Thirteenth National Conference on Artificial Intelligence, 1996, pp. 725730. [9] L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 532, 2001. [10] E. Bauer and R. Kohavi, "An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants," Machine Learning, vol. 35, pp. 1-38, 1999.
[22] J.D. Gibbons, Nonparametric Statistical Inference, 4th ed.: Marcel Dekker Inc, 2003. [23] D. Margineantu and T. Dietterich, "Pruning Adaptive Boosting," in Proceeding 14th International Conference Machine Learning, 1997, pp. 211-218. [24] E. Alpaydin, Introduction to Machine Learning, 2nd ed.: MIT Press, 2010. [25] L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 532, 2001. [26] T.K. Ho, "The Random Subspace Method for Constructing Decision Forests," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 832-844, 1998.
294