A Theoretical Framework on the Ideal Number of Classifiers for Online Ensembles in Data Streams HAMED R. BONAB, FAZLI CAN
[email protected],
[email protected] Bilkent Information Retrieval Group, Computer Engineering Department Bilkent University, 06800, Bilkent, Ankara, Turkey
Abstract
Theorem
A priori determining the ideal number of component classifiers of an ensemble is an important problem. The volume and velocity of big data streams make this even more crucial in terms of prediction accuracies and resource requirements. There is a limited number of studies addressing this problem for batch mode and none for online environments. Our theoretical framework shows that using the same number of independent component classifiers as class labels gives the highest accuracy. We prove the existence of an ideal number of classifiers for an ensemble, using the weighted majority voting aggregation rule.
Statement. If the number of component classifiers is not equal to the number of class labels, m ≠ p, then the coefficient matrix would be rank-deficient, det A = 0. Conclusion. The ideal number of component classifiers is the number of class labels of a dataset, with the premise that they generate independent scores.
Keywords: Big data stream; ensemble size; weighted majority voting
Contributions of the Study We study the ideal number of component classifiers of online ensembles for the first time in the literature, We theoretically model online ensembles using a geometric framework and prove that for the highest prediction accuracy, the number of classifiers should be the same as the number of class labels, We refer to this as “the law of diminishing returns in ensemble construction,” We experimentally examine our hypothesis, and suggest an upper bound for the number of classifiers that gives the highest accuracy.
Experimental Results 2 well-known online ensembles, Accuracy Updated Ensemble (AUE) and Leverage Bagging (LevBag), are used as online ensemble classifiers, 6 real-world and 6 synthetic datasets are used as data streams, Synthetic Data Streams. We chose the popular Random RBF generator, since it is capable of generating data streams with arbitrary number of features and class labels. The number of class labels is chosen as 2, 4, 8, 16, 32, and 64. We reflect these in the naming of our RBF data streams (RBF-2, …). Real-world Data Streams. We selected 6 different real-world large datasets used as data streams in the literature. For implementation, MOA framework is used in our experiments and accuracy values are measured using Test-Then-Train approach.
Background There are limited studies for batch and none for online ensembles. Latinne et al. [2] propose a simple empirical procedure for limiting the number of classifiers based on the McNemar nonparametric test of significance. Oshiro et al. [3] cast the idea that, there is an ideal number of component classifiers, which exploiting more base classifiers would bring no significant performance gain and would only increase computational costs, using the weighted average area under the ROC curve (AUC) and some dataset density metrics. Hernandez-Lobato et al. [4] suggest a statistical algorithm for determining the size of an ensemble, by estimating required number of classifiers for obtaining stable aggregated predictions, using majority voting.
Figure 3: Prediction behavior of AUE and LevBag, in terms of accuracy, with doubling the number of components for synthetic and real-world data streams.
Discussion
Figure 1: Symbol Notation.
Our Geometric Framework We propose a framework for studying the theoretical side of online ensemble classifiers over data streams, based on [1], For aggregation, we use the weighted majority voting rule.
Figure 2: Schema of our geometric framework.
Optimum Weight Assignment We assume that all base classifiers are independent of each other, For optimization, the linear least squares solution (LSQ) is used, The squared Euclidean norm is used as our measure of closeness, LSQ gives the following matrix equation for optimum weights.
Comparing the accuracy peaks of AUE and LevBag, by conducting the Wilcoxon signed-ranks statistical test, results in 10 positive and 2 negative differences. It can be accepted that the accuracy peaks of AUE are statistically significantly higher than those of LevBag. Conducting the non-parametric Friedman statistical test shows that the theoretical number of classifiers are statistically significantly different from the practical peak accuracy values of both algorithms. Multiplying these theoretical number of classifiers by a constant value, in our case 2, makes the differences statistically insignificant. This can be used for obtaining upper-bounds of the ideal number of classifiers for a given data stream and ensemble classifier.
Conclusion Theoretically, using the same number of independent component classifiers as class labels gives the highest prediction accuracy, Practically, due to the violation of independency of component classifiers, determining these peak values is nearly impossible. However, upper bounds can be considered for this problem and that needs further investigation, and An important implication of our study is that comparing online ensemble classifiers should be done based on these peak values, since comparing based on a fixed number of classifiers can be misleading.
Future Work. A more specific upper bound and lower bound determinations for the number of components, extension of our theorem on batch mode ensemble classifiers, and discussion on the binary classification problems are our future work pointers.
References [1] L.-W. Chan. Weighted least square ensemble networks. In IJCNN'99, 2:1393-1396. IEEE, 1999. [2] P. Latinne, O. Debeir, and C. Decaestecker. Limiting the number of trees in random forests. In MCS '01 Cambridge, UK, July 2-4, 2001, pages 178-187, 2001. [3] T. M. Oshiro, P. S. Perez, and J. A. Baranauskas. How many trees in a random forest? In MLDM '12, Berlin, Germany, July 13-20, 2012, pages 154-168, 2012. [4] D. Hernandez-Lobato, G. Martinez-Munoz, and A. Suarez. How large should ensembles of classifiers be? Pattern Recognition, 46(5):1323-1336, 2013.
25th ACM International Conference on Information and Knowledge Management (CIKM 2016), Indianapolis, USA, 24-28 October 2016.