[4] and Cross-Validation Committee [7] are well-known and provide good perfor- ... generate the specific training and validation sets for training each network of.
Using Bagging and Cross-Validation to improve ensembles based on Penalty terms Joaqu´ın Torres-Sospedra1 , Carlos Hern´andez-Espinosa1 , and Mercedes Fern´andez-Redondo1 Department of Computer Science and Engineering, Universitat Jaume I Avda. Sos Baynat s/n, CP E-12071, Castell´ on, Spain. {jtorres , espinosa , redondo} @icc.uji.es
Abstract. Decorrelated and CELS are two ensembles that modify the learning procedure to increase the diversity among the networks of the ensemble. Although they provide good performance according to previous comparatives, they are not as well known as other alternatives, such as Bagging and Boosting, which modify the learning set in order to obtain classifiers with high performance. In this paper, two different procedures are introduced to Decorrelated and CELS in order to modify the learning set of each individual network and improve their accuracy. The results show that these two ensembles are improved by using the two proposed methodologies as specific set generators.
Keywords: Ensembles with Penalty Terms, Specific Sets, Bagging, CVC
1
Introduction
One technique used to generate classifiers consists in training a set of different of neural networks (ensemble). This procedure increases the generalization capability when the networks are not correlated according to the literature [11]. Although there are alternatives to generate ensembles: Bagging [2], Boosting [4] and Cross-Validation Committee [7] are well-known and provide good performance [3, 5]. These ensembles modify the learning set to improve the accuracy of the ensemble. However, there are other ensembles such as Deco and CELS which also provides good results [3, 5] but they are less used in the literature. In this paper, two procedures to generate specific learning sets for Decorrelated and CELS are introduced. In the first one, Bagging is used to randomly generate the specific training and validation sets for training each network of the ensemble. In the second one, an advanced version of Cross-Validation Committee is used to perform the partitioning task. The second partition procedure was successfully applied to Adaboost in [9] so we consider that Decorrelated and CELS can be improved by using the procedures proposed in this paper. This paper is organized as follows. In Section 2, the leaning process of a neural network is briefly analyzed. Moreover, Decorrelated and CELS are reviewed. In section 3, the proposed partitioning methodologies are described. The experimental setup is shown in section 4 whereas the results are in subsection 5.
2 2.1
Theoretical background Learning process and stopping criteria
The network architecture employed in this paper is the Multilayer Feedforward Network, henceforth called MF network. In the experiments, the networks have been trained for a few iterations. In each iteration, the weights of the networks have been adapted with the Backpropagation algorithm by using all the patterns from the training set, T . At the end of the iteration the Mean Square Error, M SE, has been calculated by classifying all the patterns from the the Validation set, V . When the learning process has finished, the weights of the iteration with lowest M SE in the validation set are assigned to the final network. 2.2
Description of ensemble methodologies
Decorrelated: Two version of Decorrelated, DECOv1 and DECOv2, were introduced by Rosen in [8]. In both versions, the networks are trained in serial and the main purpose is to penalize an individual classifier for being correlated with the previously trained one. For this reason Rosen added a penalty term (P ) to the M SE equation (E) as in denoted in Eq. (1): n
E (x) =
N cls ∑ c=1
(
) 1 2 n n · (dc (x) − yc (x)) + Pc (x) 2
(1)
Where n stands for the number of network in the ensemble and c for the output class. The Penalty (Eq. 2) denotes the correlation degree between a network and the previously trained one. ( ) Pcn (x) = λ · dc (x) − ycn−1 (x) · (dc (x) − ycn (x)) (2) Where λ denotes the weight of the penalty term which must be set empirically by trial and error because it depends on the classification problem. The networks are trained independently but the equations used in Backpropagation to update the weights of the MF networks have to be adapted to the new error equation. Although both versions use the same penalty, DECOv1 applies it to all the networks whereas DECOv2 only introduces the penalty in the odd networks. CELS: Cooperative Ensemble Learning System (CELS ) is another ensemble variant that modifies the target equation and, therefore, the learning algorithm [6]. In this ensemble, all the networks of the ensemble are trained in parallel. Although the error is calculated with Eq.1, the penalty is given by: Pcn (x) = λ · (ycn (x) − dc (x)) ·
N∑ nets
(
yci (x) − dc (x)
)
i=1 i̸=n
Where λ is the weight of the penalty and it must be empirically set.
(3)
3 3.1
Creating new specific sets for penalty based ensembles Combining penalties and Specific sets
The ensembles Decorrelated and CELS modify the learning set by adding a penalty term to the target equation used for minimization. However, all the networks are trained using the same training and validation sets. To perform the experiments, the original datasets, described in section 4.2, have been divided into three different subsets. The first set is the training set, T (64% of total patterns), which is used to adapt the weights of the networks. The second set is validation set, V (16% of total patterns), which is used to select the network configuration with the best estimated generalization capability. Finally, the last set is the test set, T S (20% of total patterns), which is used to get the accuracy of the network and the final results. When we refer to the original learning set, L, we really refer to the union of the training and validation sets. In this paper, two different procedures (based on Bagging and Cross-Validation) are introduced to generate different training and validation sets for each network of these ensembles based on penalty terms. Diversity of the ensemble might be positively affected by using these procedures because the networks will not use exactly the same training set (used to adapt the weight values) and validation set (used to select the best network configuration with patterns not used for training). Moreover, the new ensembles will take benefit from the use of the penalty terms introduced in Decorrelated and CELS because their aim is to reduce correlation of the networks. 3.2
Bagging as set generator
Although Bootstrap Aggregating, henceforth Bagging, is an ensemble model proposed by Breiman in [2], it can be used to generate specific training and validation sets. Concretely, the specific training set is generated for each network by randomly sampling patterns with replacement from the original learning set. According to reference [11], the generated training sets should double the original training set size (factor size n = 2). In this paper, this factor size will also be 1.5 times greater than the original training set size. The patterns from the original learning set, L, which are not in the specific training set, T net are chosen as patterns of the specific validation set, V net . The basic Decorrelated algorithm is modified according to Algorithm 1. The main difference between the original Decorrelated and the new BagDecorrelated is the inclusion of the first and second statements of the for loop. They were not in the original version and they have been included in BagDecorrelated to generate the specific training and validation sets according to Bagging. All the networks are simultaneously trained in the original CELS algorithms. For this reason the original ensemble has been adapted to use specific training and validation sets. Concretely, two versions are introduced in Algorithms 2 and 3.
Algorithm 1 BagDecorrelated {T , V , Nnetworks } for net = 1 to Nnetworks do Randomly Create T net by sampling from L with replacement Generate V net with the patterns from L which are not in T net Set a random seed for wight initialization = seednet MF Network Training {T net , V net } end for
Algorithm 2 BagCELS-m1{T , V , Nnetworks } for net = 1 to Nnetworks do Set initial weight values for net-network Randomly Create T net by sampling from L with replacement Generate V net with the patterns from L which are not in T net end for for e = 1 to epochs do for i = 1 to n · Nlearning do Select x as the i-esim element from learning set L for net = 1 to Nnetworks do Calculate output y net (x) end for for net = 1 to Nnetworks do if x is in T net then Adjust the trainable parameters end if end for end for for net = 1 to Nnetworks do Calculate M SE of network net with V net end for Save ensemble configuration and M SE end for for net = 1 to Nnetworks do Select epoch with lowest validation M SE Assign the selected epoch configuration to net Save final network end for
In the first version, BagCELS-m1, the networks are trained with the original learning set. For each epoch, all the patterns from L are presented to all the networks of the ensemble. Firstly, the output of an individual pattern, x, is calculated for all the networks of the ensemble. Then, the weights of a network, net, are adapted only if the pattern, x, is in the specific training set, T net . In the other case, the weights are kept unchanged if the pattern is not in T net . Finally, when all the patterns from the learning set have been presented to the networks, the M SE error is calculated for each network on the corresponding specific validation set, V net .
Algorithm 3 BagCELS-m2{T , V , Nnetworks } for net = 1 to Nnetworks do Random Generator Seed = seednet Randomly Create T net by sampling from L with replacement Generate V net with the patterns from L which are not in T net end for for e = 1 to epochs do for i = 1 to n · Npatterns do for net = 1 to Nnetworks do Select x as the i-esim element from learning set T net for net2 = 1 to Nnetworks do Calculate output y net2 (x) end for Adjust the trainable parameters of net end for for net = 1 to Nnetworks do Calculate M SE of network net with V net end for end for Save ensemble configuration and M SE end for for net = 1 to Nnetworks do Select epoch with lowest validation M SE Assign the selected epoch configuration to net Save final network end for
In contrast, in the second adaptation (BagCELS-m2 ) each network is trained with its specific training set. For each pattern index, i, the i − th pattern of the specific training set is used to adapt the weights for each network, net. To adapt these weights the outputs of the other networks on the pattern, i-esim element of T net , have to be calculated. Finally, the M SE error is calculated for each network on the corresponding specific validation set as done in BagCELS-m1. 3.3
Cross Validation Committee as set generator
Similarly, the ensemble model Cross-Validation Committee can also be used to generate the specific training and validation sets. In this case, the original learning set is divided into Nsets subsets, Li in eq. 4, and the specific training and validation sets are generated from them. The final network training set of network number net is T net and its validation set is V net according to the following equations. T net =
N∪ sets i=1 i̸=indexnet ,1 i̸=indexnet ,2
Li
V net = Lindexnet,1 ∪ Lindexnet,2
(4)
Where the indexes related to a neural network, indexnet,1 and indexnet,2 , are randomly set with the constraint that the different networks have different training and validation sets. The value of Nsets has been set to 10 in order to keep the size of the original training and validation sets. The base structure of CVCv3Decorrelated, CVCv3CELSm1 and CVCv3CELSm2 also corresponds to algorithms 1-3 but the procedure used to generate the specific sets is done according to CVCv3 instead of Bagging.
4 4.1
Experimental Setup Experiments
To test the performance of the proposed methods, ensembles of 3, 9, 20 and 40 networks have been built according to the original ensembles and the new BagDecorrelated, BagCELS, CVCv3Decorrelated and CVCv3CELS. The experiments have been repeated ten times with different partitions in the sets. 4.2
Description of the databases
The following problems from the UCI repository of machine learning databases [1] have been used to test the performance of the methods: Arrhythmia, Balance Scale, Cylinder Bands, BUPA liver disorders, Australian Credit Approval, Dermatology, Ecoli, Solar Flares, Glass Identification, Heart Disease, Image segmentation, Ionosphere Database, The Monk’s Problem 1 and 2, Pima Indians Diabetes, Haberman’s Survival Data, Congressional Voting Records, Vowel Database and Wisconsin Breast Cancer. The optimal training parameters for these datasets have not been included due to the lack of space but they are publicly available in a Ph.D. thesis [10].
5 5.1
Results and Discussion General measurements
To perform an exhaustive comparison, the mean Percentage of Error Reduction across all databases with respect to the Single Network (mean PER) has been calculated to obtain the general behavior of the ensembles. A P ER value of 0% means that there is no improvement in the percentage of correcly classfied patterns by the use of the ensemble method with respect to a single network whereas a positive value means that the ensembles is better than the single netowrk. Moreover, a negative value means that a single network performs better than the ensemble. The P ER value is given by eq.5. P ER = 100 ·
P erfEnsemble − P erfSinglenet 100 − P erfSingleN et
(5)
Where P erfsinglenet and P erfensemble correspond to the percentage of patterns correctly classified by the single net and the ensemble respectively.
5.2
General results
The main results for the new ensembles using Bagging to generate the specific sets are introduced in tables 1 and 2. Firstly, both versions of DECO and CELS have been improved by using Bagging as generator of the specific training and validation sets according to table 1. Table 1. Mean PER - Bagging as set generator ensemble DECOv1 BagDECOv1 BagDECOv1 DECOv2 BagDECOv2 BagDECOv2 CELS BagCELS-m1 BagCELS-m1 BagCELS-m2 BagCELS-m2
n 1.5 2 1.5 2 1.5 2 1.5 2
3-Net 24.73 22.22 24.80 24.91 22.26 25.71 21.51 21.84 21.71 21.76 24.61
9-Net 26.63 28.33 26.92 25.73 26.99 28.79 23.73 27.16 27.24 25.66 28.15
20-Net 26.84 28.54 29.62 25.93 28.18 29.25 25.75 26.71 29.00 26.70 30.16
40-Net 27.09 28.92 29.19 26.40 28.96 29.63 26.35 24.39 28.20 24.75 28.25
Secondly, the best results of BagDecorrelated are obtained when n is set to value 2. For BagCELS-m1 the best value of n depends on the ensemble size , n equal to 1.5 for 3 networks and n equal to 2 for the other three cases. The best values for the factor n in BagCELS-m2 is 2. Thirdly, the best results are provided by BagDECOv2 for ensembles of 3, 9 and 40 networks. For 20 networks, the best approach is BagCELS-m2. Furthermore, the “worst” traditional ensemble is CELS but CVCv3CELS-m2 provides the best overall results. Generating specific sets should be seriously considered. Table 2. Mean PER - Cross-Validated Committee v3 as set generator ensemble CVCv3DECOv1 CVCv3DECOv2 CVCv3CELS-m1 CVCv3CELS-m2
3-Net 24.64 24.42 25.31 23.71
9-Net 29.07 28.25 27.32 26.32
20-Net 28.84 29.79 28.23 27.33
40-Net 29.20 29.77 27.52 27.70
According to table 2, the new ensembles based on CVCv3 improve in general the results of the original ensembles. CVCv3CELS-m1 is the best ensemble for 3 nets whereas CVCv3DECOv1 (for 9 nets) and CVCv3DECOv2 (for 20 and 40 nets) are a better choice. Secondly, CVCv3DECOv1 is better than CVCv3DECOv2 for 3 and 9 networks but CVCv3DECOv2 with 20 networks provides the best overall results.
Finally, CVCv3CELS-m1 and CVCv3CELS-m2 are better than the original CELS for all the cases. Moreover, CVCv3CELS-m1 is more suitable for ensembles of 3 to 20 networks whereas CVCv3CELS-m2 fits better for 40 networks.
6
Conclusions
Some traditional ensembles (DECOv1, DECOv2 and CELS ) have been successfully fused with Bagging and Cross-Validation Committee. In general, the original ensembles have been improved by using specific training and validation sets to train each network of the ensemble. In fact, the worst traditional ensemble was CELS but the best overall results are provided by BagCELS-m2. Moreover, the new methods outperform the best results of the traditional ensembles with less networks which can be useful when computational resources are critical. Between the two alternatives to generate the specific sets, Bagging provides better results than CVCv3 in 62.5% of the total cases and provides the best overall results (20 networks and BagCELS-m2 ). However, there are specific cases in which CVCv3 is more suitable. For this reason, both procedures should be seriously considered to use with traditional ensemble methods based on penalties.
References 1. Asuncion, A., Newman, D.: UCI machine learning repository (2007), university of California, Irvine, School of Information and Computer Sciences 2. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996) 3. Fern´ andez-Redondo, M., Hern´ andez-Espinosa, C., Torres-Sospedra, J.: Multilayer feedforward ensembles for classification problems. In: ICONIP 2004 Proceedings. LNCS, vol. 3316, pp. 744–749. Springer (2004) 4. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: International Conference on Machine Learning. pp. 148–156 (1996) 5. Hern´ andez-Espinosa, C., Torres-Sospedra, J., Fern´ andez-Redondo, M.: New experiments on ensembles of multilayer feedforward for classification problems. In: IJCNN05 proceedings. pp. 1120–1124 (2005) 6. Liu, Y., Yao, X.: Simultaneous training of negatively correlated neural networks in an ensemble. In: IEEE T SYST MAN CYB. vol. 29, p. 716 (1999) 7. Parmanto, B., Munro, P.W., Doyle, H.R.: Improving committee diagnosis with resampling techniques. In: Advances in Neural Information Processing Systems. pp. 882–888 (1996) 8. Rosen, B.E.: Ensemble learning using decorrelated neural networks. Connection Science 8(3-4), 373–384 (1996) 9. Torres-Sospedra, J., Hern´ andez-Espinosa, C., Fern´ andez-Redondo, M.: Adaptive boosting: Dividing the learning set to increase the diversity and performance of the ensemble. In: ICONIP. LNCS, vol. 4232, pp. 688–697. Springer (2006) 10. Torres-Sospedra, J.: Ensembles of Artificial Neural Networks: Analysis and Development of Design Methods. Ph.D Thesis, Universitat Jaume I (2011) 11. Tumer, K., Ghosh, J.: Error correlation and error reduction in ensemble classifiers. Connection Science 8(3-4), 385–403 (1996) 12. Yildiz, O.T., Alpaydin, E.: Ordering and finding the best of k¿2 supervised learning algorithms. IEEE T PATTERN ANAL 28(3), 392–402 (2006)