Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a ... ibility can, in theory, enable them to overfittingâover-fit the training data more.
ENSEMBLE LEARNING XU YING
Abstract. In statistics and machine learning, ”’ensemble methods”’ use multiple learning algorithms to obtain better predictive inference—predictive performance than could be obtained from any of the constituent learning algorithms.
1. Introduction Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble refers only to a concrete finite set of alternative models, but typically allows for much more flexible structure to exist between those alternatives [Rokach and Maimon (2001), Zhou and Jiang (2004), Zhou and Tang (2003), Zhou et al. (2002), Zimmermann (2005), Zupan et al. (1998)]. Supervised learning algorithms are commonly described as performing the task of searching through a hypothesis space to find a suitable hypothesis that will make good predictions with a particular problem. Even if the hypothesis space contains hypotheses that are very well-suited for a particular problem, it may be very difficult to find a good one. Ensembles combine multiple hypotheses to form a (hopefully) better hypothesis. In other words, an ensemble is a technique for combining many ”weak learners” in an attempt to produce a ”strong learner”. The term ”ensemble” is usually reserved for methods that generate multiple hypotheses using the same base learner. The broader term of ”multiple classifier systems” also covers hybridization of hypotheses that are not induced by the same base learner [Arbel and Rokach (2006), Menahem et al. (2009), Rokach et al. (2008), Rokach (2008), Rokach and Maimon (2005)]. Evaluating the prediction of an ensemble typically requires more computation than evaluating the prediction of a single model, so ensembles may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra computation. Fast algorithms such as decision tree learning—decision trees are commonly used with ensembles (for example ”Random forest—Random Forest”), although slower algorithms can benefit from ensemble techniques as well. 2. Ensemble theory An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis. This hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built. Thus, ensembles can be shown to have more flexibility in the functions they can represent. This flexibility can, in theory, enable them to overfitting—over-fit the training data more than a single model would, but in practice, some ensemble techniques (especially Key words and phrases. Ensemble Learning. 1
2
XU YING
Bootstrap aggregating—bagging) tend to reduce problems related to over-fitting of the training data [Aha et al. (1991), Al-Sultan (1995), Al-Sultan and Khan (1996), Ali and Pazzani (1996), Almuallim (1996)]. Empirically, ensembles tend to yield better results when there is a significant diversity among the models. 3. Common types of ensembles 3.1. Bayes optimal classifier. The Bayes Optimal Classifier is a classification technique. It is an ensemble of all the hypotheses in the hypothesis space. On average, no other ensemble can outperform it, so it is the ideal ensemble. Unfortunately, Bayes Optimal Classifier cannot be practically implemented for any but the most simple of problems. There are several reasons why the Bayes Optimal Classifier cannot be practically implemented: Most interesting hypothesis spaces are too large to iterate over, as required by the. Many hypotheses yield only a predicted class, rather than a probability for each class as required by the term. Computing an unbiased estimate of the probability of the training set given a hypothesis is non-trivial. Estimating the prior probability for each hypothesis is rarely feasible. 3.2. Bootstrap aggregating (bagging). main—Bootstrap aggregating Bootstrap aggregating, often abbreviated as ”bagging”, involves having each model in the ensemble vote with equal weight. In order to promote model variance, bagging trains each model in the ensemble using a randomly drawn subset of the training set. As an example, the random forest algorithm combines random decision trees with bagging to achieve very high classification accuracy. 3.3. Boosting. Boosting involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified. In some cases, boosting has been shown to yield better accuracy than bagging, but it also tends to be more likely to over-fit the training data. By far, the most common implementation of Boosting is Adaboost, although some newer algorithms are reported to achieve better results Citation needed—date=January 2012. 4. Stacking Stacking (sometimes called ”stacked generalization”) involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs. If an arbitrary combiner algorithm is used, then stacking can theoretically represent any of the ensemble techniques described in this article, although in practice, a single-layer logistic regression model is often used as the combiner. Stacking typically yields performance better than any single one of the trained models. It has been successfully used on both supervised learning tasks (regression). The two top-performers in the Netflix competition utilized ”blending”, which may be considered to be a form of stacking.
ENSEMBLE LEARNING
3
References Aha et al. (1991). Aha, D. W.; Kibler, D.; and Albert, M. K., Instancebased learning algorithms. Machine Learning 6(1):37-66, 1991. Al-Sultan (1995). A1-Sultan K. S., A tabu search approach to the clustering problem, Pattern Recognition, 28:1443-1451,1995. Al-Sultan and Khan (1996). Al-Sultan K. S. , Khan M. M. : Computational experience on four algorithms for the hard clustering problem. Pattern Recognition Letters 17(3): 295-308, 1996. Ali and Pazzani (1996). Ali K. M., Pazzani M. J., Error Reduction through Learning Multiple Descriptions, Machine Learning, 24: 3, 173-202, 1996. Almuallim (1996). Almuallim H., An Efficient Algorithm for Optimal Pruning of Decision Trees. Artificial Intelligence 83(2): 347-362, 1996. Arbel and Rokach (2006). R. Arbel, L. Rokach, Classifier evaluation under limited resources. Pattern Recognition Letters 27(14): 1619-1631, 2006. Menahem et al. (2009). E Menahem, L Rokach, Y Elovici, TroikaAn improved stacking schema for classification tasks, Information Sciences 179(24):4097-4122, 2009. Rokach et al. (2008). L. Rokach, R. Romano, O. Maimon, Negation recognition in medical narrative reports, Information Retrieval 11(6): 499-538, 2008. Rokach (2008). L Rokach, Mining manufacturing data using genetic algorithm-based feature set decomposition, International journal of intelligent systems technologies and applications, 2008. Rokach and Maimon (2005). L Rokach, O Maimon, Decision trees, Data Mining and Knowledge Discovery Handbook, pp. 165-192, 2005. Rokach and Maimon (2001). L Rokach, O Mainon, Theory and applications of attribute decomposition, Proceedings IEEE International Conference on Data Mining, 2001. ICDM 2001. Zhou and Jiang (2004). Zhou Z., Jiang Y., NeC4.5: Neural Ensemble Based C4.5, IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 6, pp. 770-773, Jun., 2004. Zhou and Tang (2003). Zhou Z. H., and Tang, W., Selective Ensemble of Decision Trees, in Guoyin Wang, Qing Liu, Yiyu Yao, Andrzej Skowron (Eds.): Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, 9th International Conference, RSFDGrC, Chongqing, China, Proceedings. Lecture Notes in Computer Science 2639, pp.476-483, 2003. Zhou et al. (2002). Zhou, Z. H., Wu J., Tang W., Ensembling neural networks: many could be better than all. Artificial Intelligence 137: 239-263, 2002. Zimmermann (2005). Zimmermann H. J., Fuzzy Set Theory and its Applications, Springer, 4th edition, 2005. Zupan et al. (1998). Zupan, B., Bohanec, M., Demsar J., and Bratko, I., Feature transformation by function decomposition, IEEE intelligent systems & their applications, 13: 38-43, 1998. Zupan et al. (2000). Zupan, B., Bratko, I., Bohanec, M. and Demsar, J., 2000, Induction of concept hierarchies from noisy data, in Proceedings of the Seventeenth International Conference on Machine Learning (ICML-2000), San Francisco, CA, pp. 1199-1206. University of Georgia, USA