Multiobjective Optimization of Ensembles of Multilayer Perceptrons for pattern classification P.A. Castillo1 , M.G. Arenas2 , J.J. Merelo1 , V.M. Rivas2 and G. Romero1 1
Department of Architecture and Computer Technology University of Granada (Spain) 2 Department of Computer Science University of Ja´en (Spain) e-mail:
[email protected]
Abstract. Pattern classification seeks to minimize error of unknown patterns, however, in many real world applications, type I (false positive) and type II (false negative) errors have to be dealt with separately, which is a complex problem since an attempt to minimize one of them usually makes the other grow. Actually, a type of error can be more important than the other, and a trade-off that minimizes the most important error type must be reached. Despite the importance of type-II errors, most pattern classification methods take into account only the global classification error. In this paper we propose to optimize both error types in classification by means of a multiobjective algorithm in which each error type and the network size is an objective of the fitness function. A modified version of the GProp method (optimization and design of multilayer perceptrons) is used, to simultaneously optimize the network size and the type I and II errors. Keywords: multiobjective optimization, multiobjective evolutionary algorithms, Pareto optimality, artificial neural networks optimization, gprop, pattern classification
1
Introduction
There are two types of errors that can be made when classifying patterns. The first is the spurius detection of a non-existent effect (type I error or false positive), the second, the non-detection of an existent effect (type II error or false negative). For instance, if we conclude from a toxicity test that the tested substance is toxic when it is in fact not, then we have committed a type I error. If we conclude that a substance is not toxic when it in fact is, then we have committed a type II error [14]. This exemplifies how these errors are different in the real world, and usually have to be dealt with separately. Another example is bankruptcy prediction [5]: false positives can lead to lawsuits, while false negatives will usually lead only to loss of a customer. The most commonly used statistical measure, statistical significance, is a measure of type I error (the absence of a significative effect does not mean that the effect is inexistent). On the other hand, Neyman-Pearson’s
test is based on the minimization of type II error after establishing a limit to the type I error [31, 6]. There are many examples of incorrect conclusions due to type II error [30, 25], and caution must be taken not to interpret ‘no evidence of effect’ as ‘evidence of no effect’ [28]. Despite the importance of type II error in some real problems (medical problems, judging a person to be sick or not [21], or forecasting business failure [5]), most methods compute the global classification error, without paying attention to its distinction in different types. Most real optimization problems present several objective variables to be optimized at the same time. Solving those multiobjective problems where some objectives enter in conflict is a complex task. In these cases, instead of searching for a single solution, the method must obtain a set of solutions; later on, one solution (or several) will be used to solve the problem. Multiobjective optimization methods range from the weighted sum of the objectives which, nowadays is usually not considered the best method available, to techniques based on the Pareto front [11, 12, 22, 33]. This paper continues the research presented in [26, 27] where a comparison between statistical methods (logistic regression) and evolutionary artificial neural networks (G-Prop) for the bankruptcy prediction was carried out. Discovering when a company is going to fail is a problem traditionally approached heuristically, which requires a wide knowledge about that company, so that it only can be carried out by means of accounting experts. The interest in pinpointing a firm’s financial state of health runs from the management, who can thus count on early warning signs to aid them in taking decisions to correct foreseeable financial distress. This also applies to medical problems, such as the prediction of cancer or diabetes. G-Prop [7, 9] was designed to optimize both the architecture of the multilayer perceptron (MLP) and its classification ability. In the present paper we propose to improve the type I and type II errors using a multiobjective evolutionary algorithm (called MG-Prop) that optimizes both errors and also the MLP architecture. Our target is to design small networks that produce small type I and II errors. Approaching this problem as a multiobjective optimization task makes unnecesary weight the different objectives, and the solutions based on Pareto optimality guarantee the diversity of the final population. As a mixture of predictors usually improves the results, our method uses the final Pareto front to form an ensemble. The multiobjective problem deals with three objectives: minimizing type I and II errors, and the network size. To test the ability of the proposed method, two real pattern classification problems have been used (bankruptcy prediction and breast cancer). The remainder of this paper is structured as follows: section 2 reviews the literature on multiobjective optimization of artificial neural networks (ANN). In section 3 it is detailled the proposed method. Section 4 describes the experiments and the results obtained, followed by a brief conclusion in Section 5.
2
Neural network optimization using multi-objective methods
Evolving neural networks [34, 7, 10] is an efficient way of searching the neural net problem space. However, most of these methods either calculate a weighted sum of the objectives (i.e. minimizing both the errors and the number of weights) or assign some priority to one of the objectives [7, 10]. On the other hand, ANNs act as a black box that give their prediction on an input pattern. This might be a problem for the expert that uses an ANNbased method, due to the difficulty to explain the way the method calculated its output. To face the problem of optimizing all the parameters and objectives related to the ANN, some authors propose the use of multiobjective optimization methods, such as the Pareto differential evolution (PDE), by Abbass [4]. This algorithm is an adaptation of the original differential evolution introduced by Storn and Price [32], and was also tested successfully for evolving neural networks (Memetic Pareto Artificial Neural Network [1, 3]). Recently, Abbass casted the problem of simultaneous optimization of the network architecture and the corresponding training error as a multiobjective optimization problem [3], obtaining that combining backpropagation (BP) with an multiobjective evolutionary algorithm, a considerable reduction in the computational cost can be achieved. In a latter research, Abbass [2] proposes two multiobjective formulations to the formation of neuro ensembles: the first one splits the training set into two non-overlapping stratified subsets and form an objective to minimize the training error on each subset. The second formulation adds random noise to the training set to form a second objective. Jin et al. [17] presented a modified version of two multiobjective optimization algorithms (Dynamic Weighted Aggregation and NSGA-II) to address the neural network regularization problem from a multiobjective optimization point of view (both the structure and parameters of the neural network are optimized). In the last years, ANN research focus on neural network ensembles [23] because they are a powerful tool to face complex problems. Using an ensemble (mixture of predictors) usually results in an improvement in the prediction accuracy [16]. Many studies [18–20, 29] focus on the construction of ensembles using artificial neural networks. Some authors establish the set of networks that form the ensemble by means of evolutive [19, 20], coevolutive [15], and multiobjective methods, as Abbass proposes in [3] (using the Pareto front networks as the ensemble components). Althogh most of these methods calculate only the global classification error, we think that both type I and II errors should be taken into account.
3
MG-Prop : MultiObjective Evolution of MLP
In this paper, a multiobjective method designed to optimize both type I and II errors and the network size is proposed. The objectives might confict, so if one is improved, the other could get worse.
This method is based on the multiobjective evolutionary algorithm SFGA (Single Front Genetic Algorithm), developed by De Toro et al. and uses the elitist scheme selection proposed in [13].The idea is to copy into the next population, instead of the full elite set (non-dominated individuals) a reduced set, with individuals distributed homogeneouslly among the search space. However, this algorithm had to be adapted to our particular problem. After some experimentation, we have found that, as the evolution advances, the number of non-dominated individuals-MLP increases considerably. Moreover, some networks might commit no error at all classifying class A patterns (0%), but fail on class B patterns (100%). That infeasible individuals are non-dominated, however, they should be eliminated. The population of individuals selected to mate and generate the next generation is formed by those in the Pareto front. The designed algorithm is specified in the following pseudocode, which is an adaptation of the previously published G-Prop algorithm [7, 9]: 1.
Generate a population of N individuals (Population) 2. Evaluate the N individuals: train them using the training set and obtain their fitness (type I and type II errors) on the validation set and the network size. 3. Repeat for G generations: (a) Copy Population into P0 (b) Remove Pareto dominated individuals from Population (obtaining P 1 ) (c) Remove infeasible individuals from P1 (obtaining P2 ) (d) Prune P2 using clustering to limit the number of individuals (e) Select S individuals from P2 and apply genetic operators to copies of them (obtaining Offspring) (f) Evaluate the new individuals in Offspring (g) Fill the Population using nondominated individuals (P 2 ) and Offspring (h) If N is greater than the number of individuals in Population, then use dominated individuals in P0 to fill Population 4. Remove Pareto dominated and infeasible individuals from Population 5. Use nondominated individuals (MLPs) as an ensemble to classify the testing set and to obtain the total, type I and type II errors.
On termination, it takes a set of not previously seen patterns, and it classifies them obtaining both error types and the network size. The networks on the Pareto set will be potentially good, however, as a single MLP with good performance on the training set may not be the best network in terms of generalization, we will use all individual-MLP in the Pareto front as an ensemble to classify the test-patterns, as proposed in [2]. We have used three methods to obtain the ensemble classification: (a) the predicted class is obtained as majority voting; (b) for each pattern, the class is obtained taking into account the largest activation between all the outputs of all the networks; (c) computing the average outputs for all networks. Thus, an expert who has to decide if a person is ill (or a company is bankrupt) or not, has more information available about that case than if a single ANN was used.
4
Experiments and results
Following Prechelt’s advice [24], at least two real world problems should be used to test an algorithm. In any case, the best way to test the algorithm ability as well as its limitations is to use it to solve real world problems.
Thus, we have tested our method on two real world classification problems: the well known breast-cancer problem, and a bankruptcy benchmark problem (see below for details). 4.1
Breast-cancer problem
The dataset comes from the UCI machine learning dataset Wisconsin breast cancer database ( http://www.ics.uci.edu/˜mlearn/MLSummary.html ), which was compiled from the University of Wisconsin Hospitals and Clinics in Madison by Dr. William H. Wolberg [21]. Each sample has 10 attributes plus the class attribute: sample code number, clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, mitoses, class (0 for benign, 1 for malignant). The class distribution in the original set is as follows: 65.5% benign and 34.5% malignant.
4.2
Bankruptcy problem
The dataset comes from the Infotel database ( http://www.axesor.com ). The sample of companies contains 450 non-financial firms taken from the Infotel database, of which half are companies that have failed and the other half present a good financial health. A further 76 firms, not included in the aforementioned group, have been used as a validation sample for the models obtained. The group of financial failures corresponds to those firms that had suspended payments or had declared legal bankruptcy, in accordance with Spanish Law, while the healthy firms were randomly selected from among 150.000 companies. The comparison was made using the same sample referring to the same time period: 1998 and 1999. The dependent variable takes a value of 1 in the case of legal failure, and of 0 in the case of a healthy firm. The independent variables are quantitative ratios taken from financial statements, along with qualitative information, and their description is included in Table 1. 4.3
Methodology
Experiments were set using 50 generations,a population size of 50 individuals, and 500 training epochs in order to avoid too long a run. The remaining parameters (population size, selection rate, operator application rates, mutation probability, initial weights range, etc) where set using statistical methods (see [10] for details). Mean and standard deviation, shown in tables (see section 4.4), where obtained after 30 runs in each method. Statistical t-Student tests are used in order to evaluate the difference in means, as these can be used even in the case of small samples. This method calculates a value p that represents the error probability if the null hypothesis is accepted, that is, the error probability assuming that there is no difference
VARIABLE
DESCRIPTION
L: Liabilities
Liabilities/Equity
DE: Debt Expiration
Long-Term Liabilities/Current Liabilities
WC: Working Capital
Working Capital/Total Assets
CashR: Cash Ratio
Cash equivalent/Current Liabilities
AT: Acid Test
(Cash equivalent+Marketable Securities+ Net receivables)/Current Liabilities
CR: Current Ratio
Current Assets/Current Liabilities
DR: Debt Ratio
Total Assets/Total Liabilites
DPA: Debt Paying Ability
Operating Cash Flow/Total Liabilities
AT: Asset Turnover
Net Sales/Average Total Assets
ST: Stock Turnover
Cost of sales/Average Inventory
RT: Receivable Turnover
Net Sales/Average Receivables
ROPA: Return on Operating Assets Operating Income/Average Operating Assets OIM: Operating Income Margin
Operating Income/Net Sales
ROA: Return on Assets
Net Income/Average Total Assets
ROE: Return on Equity
Net Income/Average Total Equity
DC: Debt Cost
Interest Cost/Total Liabilities
IC: Interest Cost
Interest Cost/Sales
LAG: Date
The time lag in reporting annual accounts
L: Lawsuits
(sums challenged)/(total liabilities)
I: Incidences
Relative to auctions, impounds, etc.
Table 1. Description of independent variables.
between the levels of the observations in the population. Thus, t-Student statistical tests were used to check the validity of the results obtained (average and standard deviation), and to test whether differences among them were significant. 4.4
Results obtained Method G-Prop [8] (V) MG-Prop (A) (B)
Global error Type I error Type II error Network size 1.2 ± 0.1 1.1 ± 0.1 1.3 ± 0.1 173 ± 22 1.2 ± 0.5 1.3 ± 0.9 0.7 ± 0.6 ensemble of 1.4 ± 0.4 1.9 ± 0.8 0.5 ± 0.5 15±1 MLPs of 1.6 ± 0.4 2.0 ± 0.9 0.7 ± 0.8 120 ± 44 weights
Table 2. Average results for the Breast Cancer problem. This table shows the global error rate, the type I and II errors and size of networks. Results have been obtained using G-Prop [8] and MG-Prop. Majority voting (V), average output (A) and largest activation (B) have been used to obtain the ensemble classification. The network size is expressed in terms of number of parameters of the net, that is, the number of weights of the net. In the case of MG-Prop, an ensemble is obtained, thus both the number of components of the ensemble and the average number of weights for each MLP in the ensemble is reported.
Results on Breast Cancer are shown in table 2. As can be seen the multiobjective method obtains results comparable to those obtained using G-Prop (priorizes one objective). Although MG-Prop does not outperform G-Prop, slightly better global classification error is achieved, and differences between type I and II errors minimize (homogeneity between error types grows). On the other hand, better results are obtained using the majority voting to obtain the ensemble classification. As far as the network size is concerned, as the ensemble is composed by several MLPs, the total number of weights is higher than those obtained using G-Prop. However, network size of individual MLPs is slightly smaller than those obtained using G-Prop. T-Student tests were used to verify the results. No significant differences were found between global classification and type I errors for G-Prop and MGProp. However, differences in type II error was significant to level of 99%. In any case, Cancer is a problem such as G-Prop correctly classifies most of the patterns (small type I and II errors are committed), and that makes difficult to find differences between models. Table 3 shows the results obtained on the Bankruptcy problem using logistic regresion, G-Prop and MG-Prop. Method Global error Type I error Type II error Network size Logit [26, 27] 15.92 17.24 14.48 G-Prop [26, 27] 17.26 11.21 23.32 1042 (V) 12 ± 2 13 ± 2 12 ± 3 ensemble of MG-Prop (A) 13 ± 2 13 ± 3 13 ± 2 16±1 MLPs of (B) 15 ± 3 14 ± 3 16 ± 4 1120 ± 60 weights Table 3. Average results for the Bankruptcy problem using logistic regresion (logit), G-Prop [26, 27] and MG-Prop.
Previous research [26, 27] showed that the forecasting ability of multilayer perceptrons is slightly lower than that of logistic regression, in type II and total error, although better results are obtained in type I errors. In G-Prop, however, the fitness measure used to evolve perceptrons only take into account the total error, instead of using the type I and/or type II errors. Results obtained using MG-Prop are slightly better than those obtained using previous methods [26, 27]. As in the previous problem, differences in type I and II errors, as well as the global error decrease. As can be seen, better results are obtained using the majority voting to obtain the ensemble classification. As in previous problem, G-Prop finds just an MLP, while MG-Prop finds an ensemble (several MLPs). Thus, total number of weights is much higher for MG-Prop, although for individual MLPs, sizes are comparable. After applying t-Student tests to verify the results, differences between methods in terms of errors obtained were significant when the confidence level was 99%.
5
Conclusions and Work in Progress
In this paper we address the problem of optimizing the MLP classification ability as a multiobjective optimization problem: the type I and type II errors, and the network size are minimized. As a single MLP may not be the best network in terms of generalization, we have used the MLPs in the Pareto front as an ensemble to increase the generalization ability. The Pareto set defines the size of the ensemble and ensures that the networks in the ensemble are different and will contribute to optimize the objectives. MG-Prop takes into account error types and obtains comparable (or even better) results to GProp method (that priorizes some objectives), carrying out the trade-off between objectives (optimizes the error types obtaining similar network sizes). The method has been tested on two real pattern classification problems, breast cancer and bankruptcy, and has been found to be competitive to other methods in the literature. Although results are not much better than those obtained using other methods, slightly better global classification error is achieved and differences between type I and II errors minimize. In any case, the idea of optimizing the type I and II errors and the network size as a multiobjective problem is interesting and could be applied to improve other classification methods. In the bankruptcy problem, the model has been obtained for one year before the failure. As future work, it would be interesting to obtain it for two or more years prior to the declaration of bankruptcy. On the other hand, Zhou et al. [35] analyzed the relationship between the ensemble and its component neural networks, obtaining better results ensembling some of the neural networks instead of all of them. Taking into account these results, it would be interesting to carry out the selection of components using automatic methods, such as cooperative models. In general, majority voting seems to work better for this kind of problems.
6
Acknowledgements
This work has been supported by CICYT TIC2003-09481-C04-04 project.
References 1. H.A. Abbass. A memetic Pareto evolutionary approach to artificial neural networks. In M. Stumptner, D.Corbett and M. Brooks, editors, Proceedings of the 14th Australian Joint Conference on Artificial Intelligence (AI’01), pages 1-12, Berlin, 2001. 2. H.A. Abbass. Pareto neuro-evolution: constructive ensemble of neural networks using multi-objective optimization. IEEE Congress on Evolutionary Computation (CEC2003), IEEE-Press, Vol. 3, pp. 2074-2080, Canberra, Australia, 2003. 3. H.A. Abbass. Speeding up back-propagation using multiobjective evlutionary algorithms. Neural Computation, MIT Press, Vol. 15, No 11, pp. 2705-2726, 2003.
4. H.A. Abbass, R.A. Sarker, and C.S. Newton. PDE: A Pareto-frontier differential evolution approach for multi-objective optimization problems. In Proc. of the IEEE Congress on Evolutionary Computation (CEC2001), vol.2,pp.971978,Piscataway,NJ, IEEE Press, 2001. 5. E.I. Altman. The success of business failure prediction models. an international survey. Journal of Banking, Accounting and Finance, 8:171-198, 1984. 6. R.J. Barton. Neyman-Pearson Hypothesis TestingSignal. ECE 6333 Signal Detection and Estimation. http://www2.egr.uh.edu/∼rbarton/Classes/ECE6333/files.pdf/Notes/Notes 919-05.pdf, 2005. 7. P. A. Castillo, J. Carpio, J. J. Merelo, V. Rivas, G. Romero, and A. Prieto. Evolving Multilayer Perceptrons. Neural Processing Letters, vol. 12, no. 2, pp.115-127. October, 2000. 8. P. A. Castillo, J. Gonz´ alez, J. J. Merelo, V. Rivas, G. Romero, and A. Prieto. G-Prop-III: Global Optimization of Multilayer Perceptrons using an Evolutionary Algorithm. In Congress on Evolutionary Computation,In Genetic and Evolutionary Computation Conference, ISBN:1-55860-611-4, Volume I, pp. 942, Orlando, USA, 1999. 9. P. A. Castillo, J. J. Merelo, V. Rivas, G. Romero, and A. Prieto. G-Prop: Global Optimization of Multilayer Perceptrons using GAs. Neurocomputing, Vol.35/1-4, pp.149-163, 2000. 10. P.A. Castillo, J.J. Merelo, G. Romero, A. Prieto, and I. Rojas. Statistical Analysis of the Parameters of a Neuro-Genetic Algorithm. in IEEE Transactions on Neural Networks, vol.13, no.6, pp.1374-1394, ISSN:1045-9227, november, 2002. 11. C.A. Coello Coello, D.A. Van-Veldhuizen, and G.B. Lamont. Evolutionary algorithms for solving multi-objective problems. Kluwer Academic Publishers, New York, ISBN 0-3064-6762-3, 2002. 12. Carlos A. Coello Coello and Nareli Cruz Cort´es. Solving Multiobjective Optimization Problems using an Artificial Immune System. Genetic Programming and Evolvable Machines, Vol. 6, No. 2, pp. 163–190, 2005. 13. F. de Toro, J. Ortega, J.Fernandez, and A.F Diaz. Parallel genetic algorithm for multiobjective optimization. 10th Euromicro Workshop on Parallel, Distributed and Network-based processing, IEEE Computer Society, pp. 384-391, 2002. 14. J.A. Freiman, T.C. Chalmers, and H. Smith. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. New England Journal of Medicine, 299:690-694, 1978. 15. N. Garcia-Pedrajas, C. Hervas-Martinez, and D. Ortiz-Boyer. Cooperative coevolution of artificial neural network ensembles for pattern classification. IEEE Trans. Evolutionary Computation 9(3): 271-302, 2005. 16. H. Ishibuchi and T. Yamamoto. Evolutionary multiobjective optimization for generating an ensemble of fuzzy rule-based classifiers. In Proc. of the Genetic and Evolutionary Computation Conference (GECCO2003), Lecture Notes in Computer Science (LNCS), pp.1077-1088, Chicago, IL, 2003. 17. Y. Jin, T. Okabe, and B. Sendhoff. Neural network regularization and ensembling using multiobjective evolutionary algorithms. Congress on Evolutionary Computation, CEC2004. Vol.1, pp.1-8. ISBN:0-7803-8515-2, 2005. 18. A. Krogh and J. Vedelsby. Neural network ensembles, cross validation and active learning. In G. Tesauro, D.S. Touretzky and T.K. Leen editors, Advances in Neural Information Processing Systems, vol.7, pp.231-238. MIT Press, 1995. 19. Y. Liu and X. Yao. Ensemble leagning via negative correlation. Neural Networks, 12(10):1399-1404, 1999.
20. Y. Liu, X. Yao, and T. Higuchi. Evolutionary ensembles with negative correlation learning. IEEE Transactions on Evolutionary Computation, 4(4):380-387, 2000. 21. O. L. Mangasarian, R. Setiono, and W.H. Wolberg. Pattern recognition via linear programming: Theory and application to medical diagnosis. Large-scale numerical optimization, Thomas F. Coleman and Yuying Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30, 1990. 22. J. Olvander. Robustness considerations in multi-objective optimal design. Journal of Engineering Design, Vol. 16, No. 5, pp. 511–523, 2005. 23. M.P. Perrone and L.N. Cooper. When networks disagree: Ensemble methods for hybrid neural networks. Neural Networks for Speech and Image Processing, R.J.Mammone, Ed., pp.126-142, 1993. 24. L. Prechelt. PROBEN1 — A set of benchmarks and benchmarking rules for neural network training algorithms. Technical Report 21/94, Fakult¨ at f¨ ur Informatik, Universit¨ at Karlsruhe, D-76128 Karlsruhe, Germany, September 1994. 25. C. Rolf, T.G. Cooper, C.H. Yeung, and E. Nieschlag. Antioxidant treatment of patients with asthenozoospermia or moderate oligoasthenozoospermia with highdose vitamin C and vitamin E: a randomized, placebo-controlled, double blind study. Hum. Reprod., 14, 1028-1033, 1999. 26. I. Roman, J.M. de la Torre, P.A. Castillo, and J.J. Merelo. Sectorial bankruptcy prediction analysis using artificial neural networks: The spanish companies case. 25th Annual Congress European Accounting Association. Pp. 237, Copenhagen, April, 2002. 27. I. Roman, J.M. de la Torre, M.E. Gomez, P.A. Castillo, and J.J. Merelo. Bankruptcy prediction adapted to firm characteristics. an empirical study. 26th Annual Congress European Accounting Association. Congress Book. pp. A-108. Sevilla, April, 2003. 28. J. Savulescu, I. Chalmers, and J. Blunt. Are research ethics committees behaving unethically? some suggestions for improving performance and accountability. Br. Med. J., 313, 1390-1393, 1996. 29. A.J.C. Sharkey. On combining artificial neural nets. Connection Science, 8:299313, 1996. 30. S.M Smith. Statistical scrotal effect. Nature, 368, 501-502, 1994. 31. Jonathan A.C. Sterne. Teaching hypothesis tests - time for significant change? STATISTICS IN MEDICINE. 21:985-994 (DOI: 10.1002/sim.1129), 2002. 32. R. Storn and K. Price. Differential evolution: a simple and efficient adaptive scheme for global optimization over continuous spaces. Technical Report TR-95-012, International Computer Science Institute, Berkeley, 1995. 33. D.A. Van-Veldhuizen and G.B. Lamont. Multiobjective evolutionary algorithms: analyzing the state-of-the-art. Evolutionary Computation 8(2): 125-147, 2000. 34. X. Yao. Evolving artificial neural networks. Proceedings of the IEEE, 87(9):14231447, 1999. 35. Z.H. Zhou, J. Wu, and W. Tang. Ensembling neural networks: many could be better than all. Artificial Intelligence, vol.137, no.1-2, pp.239-253, 2002.