Part of the Lecture Notes in Computer Science book series (LNCS, volume 3578). Cite this paper as: Povalej P., LeniÄ M., Kokol P. (2005) Improving Ensembles ...
Improving Ensembles with Classificational Cellular Automata Petra Povalej, Mitja Lenič, and Peter Kokol Faculty of Electrical Engineering and Computer Science University of Maribor, Smetanova ulica 17,2000 Maribor, Slovenia {Petra.Povalej,Mitja.Lenic,Kokol}@uni-mb.si http://lsd.uni-mb.si Abstract. In real world there are many examples where synergetic cooperation of multiple entities performs better than just single one. The same fundamental idea can be found in ensemble learning methods that have the ability to improve classification accuracy. Each classifier has specific view on the problem domain and can produce different classification for the same observed sample. Therefore many methods for combining classifiers into ensembles have been already developed. Most of them use simple majority voting or weighted voting of classifiers to combine the single classifier votes. In this paper we present a new approach for combining classifiers into an ensemble with Classificational Cellular Automata (CCA), which exploit the cellular automata selforganizational abilities. We empirically show that CCA improves the classification accuracy of three popular ensemble methods: Bagging, Boosting and MultiBoosting. The presented results also show important advantages of CCA, such as: problem independency, robustness to noise and no need for the user input.
1 Introduction In recent years there has been a growing interest in the area of combining classifiers into ensembles also known as committees or multiple classifier systems. The intuitive concept of ensemble learning methods is that no single classifier can claim to be uniformly superior to any other, and that the integration of several single approaches will enhance the performance of final classification [1]. Hence, using the classification capabilities of multiple classifiers, where each classifier may make different and perhaps complementary errors, tend to yield an improved performance over single classifiers. Some ensemble learning approaches are actively trying to perturb some aspects of the training set, such as training samples, attributes or classes, in order to ensure classifier diversity. One of the most popular perturbation approaches are Bootstrap Aggregation (Bagging) and Boosting. Bagging firstly introduced by Breimen [2] in 1996 manipulates the training samples and forms replicate training sets. The final classification is based on a majority vote. Boosting introduced by Freund and Schapire in 1996 [3] combines classifiers with weighted voting and is more complex since the distribution of training samples in training set is adaptively changed according to the performance of sequentially constructed classifiers. In general ensemble learning approaches can be divided into three groups: (1) ensemble learning approaches that combine different independent classifiers (such as: M. Gallagher, J. Hogan, and F. Maire (Eds.): IDEAL 2005, LNCS 3578, pp. 242–249, 2005. © Springer-Verlag Berlin Heidelberg 2005
Improving Ensembles with Classificational Cellular Automata
243
Bayesian Voting, Majority Voting, etc.), (2) ensemble learning approaches which construct a set of classifiers on the basis of one base classifier with perturbation of training set (such as: Bagging, Boosting, Windowing, etc.) and (3) a combination of (1) and (2). A detailed empirical study is presented in [4]. When studying the groups (1) and (2) we came across some drawbacks. The most essential deficiency in (1) is their restriction with predefined way of combining classifiers induced on the basis of different predefined methods of machine-learning. On the contrary, the ensemble learning approaches that are based on improving one base classifier (2), use only one method for constructing all classifiers in ensemble. Therefore a problem of selecting the appropriate method for solving a specific task arises. As an example of an ensemble learning approach which combines (1) and (2) we presented Classificational Cellular Automata (CCA) in [5]. The basic idea of CCA is to combine different classifiers induced on the basis of various machine-learning methods into an ensemble in a non-predefined way. After several iterations of applying adequate transaction rules only the set of classifiers, which contribute most to the final classification, is preserved. Consequently the problem of choosing appropriate machine learning method or a combination of them is automatically solved. The idea of using cellular automata as a model for ensemble learning is presented in section 3. In this paper we focus on CCAs ability to improve the classification accuracy of ensemble learning methods. The experiments using Bagging, Boosting and MultiBoost ensemble learning methods on 9 randomly chosen databases from UCI repository are presented in section 4. The paper concludes with some final remarks.
2 The Basics of Cellular Automata The concept of CA was firstly proposed in early 1960’s by J. Von Neumann [6] and Stan Ulam. From those early years until now, the CA have attracted many researchers from all science domains – physical and social. The reason for the popularity of CA is their simplicity and enormous potential of modeling behavior of complex systems. CA can be viewed as a simple model of a spatially extended decentralized system made up of a number of individual components (cells). Each individual cell is in a specific state which changes through the time depending on the state of neighborhood cells and according to the transaction rules. In spite of their simplicity, when iterated several times, the dynamics of CA is potentially very rich, and ranges from attracting stable configurations to spatio-temporal chaotic features and pseudo-random generation abilities. Those abilities enable the diversity that can possibly overcome local optima by solving engineering problems. Moreover, from the computational viewpoint, they are universal, could say, as powerful as Turing machines and, thus, classical Von Neumann architectures. These structural and dynamical features make them very powerful: fast CA-based algorithms are developed to solve engineering problems in cryptography and microelectronics for instance, and theoretical CA-based models are built in ecology, biology, physics and image-processing. On the other hand, these powerful features make CA difficult to analyze – almost all long-term behavioral properties of dynamical systems, and cellular automata in particular, are unpredictable. However, in this paper the aim is not to analyze the process of CA rather to use it for superior classification tasks.
244
Petra Povalej, Mitja Lenič, and Peter Kokol
3 Classificational Cellular Automata In this section we will introduce the basic structure, learning algorithm and initial parameters of CCA. CCA is presented as a classifier. Generally, learning a classifier is based on samples from a training set. Every training instance is completely described by a set of attributes (instance properties) and class (decision). CCA is initially defined as a 2D lattice of cells. Each cell can contain a classifier and according to it’s classification of an input instance the cell can be in one of the k states, where k is the number of possible classes plus the state “can not classify”. The last state has especially important role when such classifier is used, which is not defined on a whole learning set, i.e. when using if-then rule. Therefore, from the classification point of view in the learning process the outcome of each cell can be: (1) the same as the training instance’s class, (2) different from the training instance’s class or (3) “cannot classify”. However, a cell with unknown classification for current training instance should be treated differently as misclassification. Beside the cell’s classification ability, the neighborhood plays a very important role in the self-organization ability of a CCA. Transaction rules depend on the specific neighborhood state to calculate a new cell’s state and have to be defined in such a way to enforce self-organization of CCA, which consequently should lead to generalization process. In general we want to group classifiers that support similar hypothesis and have therefore similar classification on training instances. Therefore even if instance is wrongly classified, the neighborhood can support a cell by preventing the elimination of its classifier from automata. With that transaction rule we encourage a creation of decision centers for a specific class and in this way we can overcome the problem of noisy training instances. As for all ensemble learning approaches it is clear that if all classifiers are identical or even similar, there can be no advantage in combining their decisions, therefore some difference among base classifiers is a necessary condition for improvement. The diversity of classifiers in CCA cells is ensured by using different machinelearning methods for classifier induction. However, there is only limited number of machine-learning methods, which can present a problem for large CCA. But most of the methods have some tuning parameters that affect classification and therefore, with changing those parameters, a lot of different classifiers can be obtained. Another possibility for obtaining several different classifiers is by changing expected probability distributions of input instances, which may also result in different classifiers, even by using the same machine learning method with the same parameters. Still another approach is the feature reduction/selection. That technique is recommended when a lot of features are presented. 3.1 Learning Algorithm Once the diversity of induced classifiers is ensured by the means discussed above, the classifiers are placed into a pool. For each classifier the basics statistic information such as confidence and support is preserved. In the process of filling CCA for each cell a classifier is randomly chosen from the pool. After filling a CCA the following learning algorithm is applied:
Improving Ensembles with Classificational Cellular Automata
245
Input: learning set with N learning samples Number of iterations: t=1,2,…T For t=1,2…T: − choose a learning sample I − each cell in automaton classifies the learning sample I − change cells energy according to the transaction rules − a cell with energy bellow zero does not survive − probabilistically fill the empty cells with classifiers from the pool
Beside its classifier information, each cell contains also statistical information about its successfulness in a form of cell’s energy. Transaction rules can increase or decrease the energy level dependent on the successfulness of classification and the state of the cell’s neighbors. Each neighborhood cell influences the energy of the current cell dependent on its score (Eq. 1). score = support • conficence • distance ; (1) where distance is the Euclidian distance from the neighborhood cell to the current cell. The sum of scores of all neighborhood cells that equally classified the learning sample as the current cell (eqScore) is used in transaction rules to calculate the cells new energy (e). Similarly, the sum of scores of all neighborhood cells which can not classify the learning sample (noClassScore) is calculated and used in the following transaction rules: • If a cell has the same classification as the sample class: (a)if noClassScore>0 than increase energy of the cell using the equation (Eq. 2). eqScore • 100 e = e + 100 − noClassSco re
(2)
(b) if all neighbourhood cells classified the learning sample (noClassScore=0) than increase cell‘s energy according to (Eq. 3). e = e + 400 (3) • If a cell classification differs from the learning sample class: (a) if noClassScore>0 than decrease energy of the cell using (Eq. 4). eqScore • 100 e = e − 100 − noClassSco re
(4)
(b) if noClassScore=0 than decrease cell‘s energy using (Eq. 5). (5) • If a cell cannot classify learning sample then slightly decrease energy state of the cell (Eq. 6). e = e − 10 (6) Through all iterations all cells use one point of energy (to live). If energy drops below zero the cell is terminated (blank cell). A new cell can be created dependent on learning algorithm parameters with its initial energy state and a classifier used from pool of classifiers or newly generated classifier. Of course if cell is too different from the neighborhood it will ultimately die out and the classifier will be returned to the pool.
246
Petra Povalej, Mitja Lenič, and Peter Kokol
The learning of a CCA is done incrementally by supplying samples from the learning set. Transaction rules are executed first on the whole CCA with a single sample and then continued with the next until the whole problem is learned by using all samples – that is a similar technique than used in neural networks [Towel et al., 1993]. Transaction rules do not directly imply learning, but the iteration of those rules creates the self-organizing ability. However, this ability depends on classifiers used in CCAs cells, and its geometry. Stopping criteria can be determined by defining fixed number of iterations or by monitoring accuracy. 3.2 Inference Algorithm Inference algorithm differs from learning algorithm, because it does not use selforganization. Input: a sample for classification Number of iterations: t=1,2,…V For t=1,2…V − each cell in automaton classifies the sample − change cells energy according to the transaction rules − each cell with energy bellow zero does not survive Classify the sample according to the weighted voting of the survived cells Output: class of the input sample
The simplest way to produce single classification would be to use the majority voting of cells in CCA. However some votes can be very weak from the transaction rule point of view. Therefore the transaction rules which consider only the neighborhood majority vote as a sample class are used in order to eliminate all weak cells. After several iterations of transaction rules only cells with strong structural support survive. The final class on an input sample is determined by weighted voting where the energy state of each survived cell is considered as a weight.
4 Experiments and Results In order to compare CCA with other ensemble methods we used Weka 3 – Data Mining Open Source written in Java [7]. The following three well known ensemble methods implemented in Weka were used on a collection of 9 randomly chosen datasets from UCI Machine Learning Repository [8]: • Bagging [2], • AdaBoostM1 – a method of boosting by using AdaBoost algorithm [3], • MultiBoostAB – a method that combines AdaBoost with wagging [9]. As a basic classifier a C4.5 decision tree [10] was used in all ensemble methods listed above. The direct comparison between CCA and other methods for combining classifiers into an ensemble listed above was made using classifiers induced by each ensemble method as a source for filling CCA. In fact all classifiers were put in the pool of classifiers and then used in the process of initialization and learning of CCA. CCA was initialized with the following parameters: size: 10x10 matrix (bounded into torus), neighborhood radius r=5 and stopping criteria: number of iterations t=1000.
Improving Ensembles with Classificational Cellular Automata
247
Through all iterations all cells use one point of energy (to live). If the energy level of a cell drops below zero the cell is terminated. In the experiments presented in this section we used the same evaluation criteria as presented in [5]. 4.1 Empirical Evaluation In the first experiment Bagging was used for creating an ensemble of 50 classifiers. All classifiers from the ensemble were introduced in the pool and then used as a source of diverse classifiers for CCA (Bagged CCA). The results gained on 9 randomly chosen UCI databases are presented in Table 1. If the accuracy and class accuracy on learning set is compared, we can see that there is no essential deviation. However, when an accuracy of classifying unseen test cases is considered an observation can be made that Bagged CCA had better or at least equal performance as Bagging in all cases. Since in both methods the same base classifiers, we can conclude, that improvement of classification accuracy results only from organization and transaction rules of CCA. Table 1. Comparison between Bagging and CCA with bagged classifiers Data australian breast cleve diabetes glass golf heart iris pima Average ∆ accuracy
Bagging Learn set Test set ∆ ם ∆ ם 94,35 94,30 86,52 86,78 98,71 98,86 96,99 97,45 95,05 95,06 77,23 76,52 95,31 94,36 73,83 69,94 94,37 95,47 69,44 66,13 100,00 100,00 100,00 100,00 96,11 95,83 84,44 84,41 100,00 100,00 92,00 92,00 96,48 95,84 78,13 74,41 96,71 96,63 84,29 83,07 םaverage class accuracy
Bagged CCA Learn set Test set ∆ ם ∆ ם 95,00 94,91 87,39 87,32 98,71 98,86 96,99 97,45 96,54 96,52 84,16 83,12 93,75 92,27 75,00 72,80 95,07 95,81 70,83 72,08 100,00 100,00 100,00 100,00 95,00 94,71 88,89 88,61 100,00 100,00 92,00 92,00 94,73 94,46 78,52 76,90 96,53 96,39 85,97 85,59
Additional experiment was made in order to compare CCA with one of the most successful methods for combining classifiers into an ensemble – AdaBoostM1. 50 individual classifiers induced with AdaBoost M1 method were used a source classifiers for CCA (AdaBoostM1 CCA). As expected AdaBoostM1 ensemble method boosted the classification accuracy on learning set to 100% in all cases which can not be affirmed for CCA that used the same classifiers (Table 2). Despite of that a closer look shows that the accuracy and average class accuracy on the learning set has not decreased more than 3,98% in the worst case when using CCA. On the other hand, the accuracy and average class accuracy of classifying unseen test samples has increased in all cases except in two (where CCA obtained the same result as AdaBoostM1) when using CCA. Thereafter an observation can be made that a consequence of a slight decrement in learning accuracy is probably less overfitting and consequently better results on test set. Accordingly the same conclusion can be made as in previous experiment – the improvement in classification accuracy is a direct consequence of using CCA as a method for combining classifiers into an ensemble.
248
Petra Povalej, Mitja Lenič, and Peter Kokol Table 2. CCA using classifiers induced with AdaBoostM1 compared to AdaBoostM1 Data
australian breast cleve diabetes glass golf heart iris pima Average ∆ accuracy
AdaBoostM1 Learn set Test set ∆ ם ∆ ם 100,00 100,00 86,96 86,59 100,00 100,00 97,00 97,87 100,00 100,00 77,23 77,57 100,00 100,00 71,48 68,17 100,00 100,00 81,94 72,46 100,00 100,00 100,00 100,00 100,00 100,00 84,44 84,41 100,00 100,00 92,00 92,00 100,00 100,00 75,39 71,76 100,00 100,00 85,16 83,43 םaverage class accuracy
AdaBoostM1 CCA Learn set Test set ∆ ם ∆ ם 99,78 99,77 88,70 88,77 99,36 99,49 99,00 97,87 96,04 96,02 85,15 85,49 100,00 100,00 75,39 72,26 100,00 100,00 81,94 79,11 100,00 100,00 100,00 100,00 98,33 98,27 90,00 90,28 100,00 100,00 92,00 92,00 97,66 97,11 79,69 76,20 99,02 98,96 87,98 86,89
The last experiment was made with MultiBoostAB ensemble learning method. 50 individual classifiers induced with MultiBoostAB were used as source of classifiers for CCA (MultiBoostAB CCA) (Table 3). Likewise AdaBoostM1 ensemble method the MultiBoostAB reached 100% classification accuracy on learning set in all cases. MutiBoostAB CCA made a classification error only on two databases when classifying learning samples. However, the results on test sets are much more interesting. The MultiBoostAB CCA performed better (in 7 cases out of 9) or at least as good as MultiBoostAB (in 2 cases) according to accuracy and average class accuracy of classifying unseen test samples. The presented results show additional example of CCAs ability to improve classification accuracy of ensembles. Table 3. CCA using classifiers induced with MultiBoostAB compared to MultiBoostAB Data australian breast cleve diabetes glass golf heart iris pima Average ∆ accuracy
MultiBoostAB Learn set Test set ∆ ם ∆ ם 100,00 100,00 86,96 86,59 100,00 100,00 97,00 97,87 100,00 100,00 77,23 77,57 100,00 100,00 71,48 68,17 100,00 100,00 81,94 72,46 100,00 100,00 100,00 100,00 100,00 100,00 84,44 84,41 100,00 100,00 92,00 92,00 100,00 100,00 75,39 71,76 100,00 100,00 85,16 83,43 םaverage class accuracy
MultiBoostAB CCA Learn set Test set ∆ ם ∆ ם 99,35 99,33 88,26 88,59 100,00 100,00 97,00 97,87 98,52 98,51 83,17 84,94 100,00 100,00 75,39 72,09 100,00 100,00 86,11 83,33 100,00 100,00 100,00 100,00 100,00 100,00 87,78 88,01 100,00 100,00 94,00 94,23 100,00 100,00 78,52 75,02 99,76 99,76 87,80 87,12
5 Discussion and Conclusion In this paper we presented a new approach to combine diverse classifiers an ensemble using the model of cellular automata. We empirically proved CCAs ability to improve the classification accuracy compared to Bagging, Boosting and MultiBoost ensemble learning methods on 9 randomly chosen databases from UCI repository by using the same base classifiers as compared methods and different combination technique.
Improving Ensembles with Classificational Cellular Automata
249
We can conclude that improvement of classification accuracy results only from organization ability and transaction rules of CCA. However, CCA approach has also some drawbacks. From computation point of view CCA approach uses additional power to apply transaction rules, which can be expensive in the learning process, but its self-organizing feature can result in better classification, that can also mean less costs. But additional advantages of the resulting self-organizing structure of cells in CCA is the problem independency, robustness to noise and no need for the additional user input. The important research direction in the future are to analyze the resulting self-organized structure, impact of transaction rules on classification accuracy, introduction other social aspect for cells survival.
References 1. Wolpert, D., Macready, W.: No Free Lunch Theorems for optimization. IEEE Transactions on Evolutionary Computation 1(1), (1997) 67-82 2. Breiman, L.: Bagging Predictors. Machine Learning. Vol. 24, No. 2, (1996) 123-140 [http://citeseer.ist.psu.edu/breiman96bagging.html] 3. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm, In: Proceedings Thirteenth International Conference on Machine Learning, Morgan Kaufman, San Francisco, (1996) 148-156 4. Dietterich, T., G.: Ensemble Methods in Machine Learning, In: J. Kittler and F. Roli (Ed.) First International Workshop on Multiple Classifier Systems, Lecture Notes in Computer Science (2000) 1-15, New York: Springer Verlag. 5. Kokol, P., Povalej, P., Lenič, M., Štiglic, G.: Building classifier cellular automata, In: Slot, P. M.A. (Ed.), Chopard, B. (Ed.). 6th international conference on cellular automata for research and industry, ACRI 2004, Amsterdam, The Netherlands, October 25-27, 2004: (Lecture notes in computer science, 3305), Springer Verlag, (2004) 823-830 6. Neumann J.: Theory of Self-Reproducing Automata, Burks, A. W. (Ed.), Univ. of Illinois Press, Urbana and London (1966) 7. Witten, H. I., Frank, E.: Data Mining: Practical machine learning tools with Java implementations, Morgan Kaufmann, San Francisco (2000) 8. Blake, C., L., Merz, C., J.: UCI Repository of machine learning databases, Irvine, CA: University of California, Department of Information and Computer Science. [http://www.ics.uci.edu/~mlearn/MLRepository.html]. 9. Webb, G.: MultiBoosting: A Technique for Combining Boosting and Wagging, Machine Learning, 40(2) (2000) 159-196 10. Quinlan, R.: C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA (1993).