An Enhancement of Generalization Ability in Cascade Correlation

0 downloads 0 Views 104KB Size Report
Key words: cascade correlation algorithm, early stopping, overfitting, overtraining ... in Artificial Neural Network (ANN) based on a combination of two algorithms: ...
Neural Processing Letters 6: 43–50, 1997. c 1997 Kluwer Academic Publishers. Printed in the Netherlands.

43

An Enhancement of Generalization Ability in Cascade Correlation Algorithm by Avoidance of Overfitting/Overtraining Problem IGOR V. TETKO1;2 and ALESSANDRO E.P. VILLA1 1

Laboratoire de Neuro-heuristique, Institut de Physiologie, Universit´e de Lausanne, Rue du Bugnon 7, Lausanne, CH–1005, Switzerland; 2 Department of Biomedical Applications, Institute of Bioorganic and Petroleum Chemistry, Murmanskaya, 1, Kiev–660, 253660, Ukraine E-mail: [email protected] Key words: cascade correlation algorithm, early stopping, overfitting, overtraining Abstract. The current study investigates a method for avoidance of an overfitting/overtraining problem in Artificial Neural Network (ANN) based on a combination of two algorithms: Early Stopping and Ensemble averaging (ESE). We show that ESE provides an improvement of the prediction ability of ANN trained according to Cascade Correlation Algorithm. A simple algorithm to estimate the generalization ability of the method according to the Leave-One-Out technique is proposed and discussed. In the accompanying paper the problem of optimal selection of training cases is considered for accelerated learning of the ESE method.

1. Introduction Artificial Neural Network (ANN) represents a powerful regression method. The crucial feature of this technique is its generalization ability. It is very important to design an ANN model that is able to predict new cases of a test set with reliable accuracy. ANN with a single hidden layer (and also multiple hidden layer networks) can interpolate any multidimensional function with given accuracy and can implement an arbitrary finite training set [1]. However, an increase in the number of hidden neurons can lead to the so called overfitting/overtraining problem: an ANN starts to fit some ‘noise’ thereby impairing its generalization accuracy. A concurrent problem that also influences the performance of an ANN is the underfitting problem. A network with only few neurons may fail to fully detect the signal in a complex data set. Both these problems are similar to the dilemma of selecting the degree of smoothing in nonparametric estimation [2]. The underfitting problem can be avoided by using algorithm that dynamically grow a number of hidden layer neurons in the network algorithms, such as Cascade Correlation Algorithm (CCA) [3], or using ‘sufficiently large’ network (the number of weights in a network can be comparable to the number of cases in the input data set). Several possible solutions have been proposed for the overfitting problem. On one hand, the following methods are aimed to decrease the number of adjustable parameters in an ANN:

44

IGOR TETKO AND ALESSANDRO VILLA

(i) constructive or pruning algorithms [4] start with a large ANN and dynamically adjust the number of weights (nodes) in ANN during its learning; (ii) penalized regularization algorithms are based on various weight decays [4]; (iii) specialized architecture and weight sharing [5]. On the other hand, methods have been suggested to increase the size of the training data set by including cases with added noise to the input data. This is done by applying some ‘finite temperature’ learning [6]. The Bayesian estimation has been found very effective for ANN training too [7]. However, these approaches have certain disadvantages. One disadvantage is that specific knowledge about the structure of input data sets (e.g., weight sharing) is required. The method may fall into local minima (e.g., pruning algorithms or weight decays methods) or may become extremely time consuming (e.g., Bayesian estimations) and sensitive to parametrization (e.g., magnitude of constants in weight decays, noise level in training with noise). An alternative approach is based on the widely accepted observation, that the generalization error decreases in the early period of training, reaches a minimum and then increases, while the training error is characterized by a monotone decrease. Therefore, it is recommended to stop the training at an optimal period of time. This technique is often referred to as early stopping [8, 9 10]. One more problem is pertinent for the generalization ability of ANN. Neural networks represent a family of so called unstable methods. This means that small perturbations in the training sets of such methods may result in significant changes of the predicted values [11]. The ensemble average calculated over multiple predictors (e.g., bagging [11]) can significantly improve the generalization ability of such methods. A method that combines the two techniques – early stopping and ensemble average (hereafter referred to as ‘Early Stopping over an Ensemble,’ ESE) – was recently proposed [10, 12] and its advantages were demonstrated for training of ANN with fixed-size architectures. The present study investigates an application of ESE for ANN trained according to CCA [3]. 2. Data Sets Data sets were generated according to a general form:

y  y(x ) = g(x ) + " (1) Here x were generated according to a random uniform distribution in the interval [0.1;0.9] (it is a standard range for ANN input) and  is a noise generated according i

i

i

i

i

i

to a normal distribution with zero mean. Three different functions g (x) were investigated: function #1: function #2: function #3:

g1 (x) = x  g2 (x) = sin 2(1 x2) g3 (x) = e 4 4e 8 + 3e 12 x

x

x

(2) (3) (4)

AN ENHANCEMENT OF GENERALIZATION ABILITY

45

These functions (i.e., linear, asymptotic periodic and exponential decay) cover a general classes of functions for one dimensional regression analysis. Training data sets consisted of 50 cases, while two test data sets of 1,000 cases were generated for each function. The first test data sets were generated with the same level of noise of the training data sets. The second sets were generated without addition of noise (i = 0). The output values yi were normalized to [0.1.0.9] range before ANN training. 3. Ensemble Analysis ANNs were trained according to CCA [3]. The number of hidden layer neurons was restricted to 50. ANN ensemble (ANNE) consisted of 100 networks. In order to estimate the statistical parameters, 50 ensembles (5,000 nets in total) were analyzed. A random partition of an input data set of 50 cases into learning and validated data sets of equal size was used for each network. All networks were randomly initialized. An early stopping point (ESP) was determined for a network when the validation error started to increase. Less then twenty hidden neurons (Nh = 20) were required to locate an ESP. To study the overtraining problem, the network was allowed to grow up to a total number of Nh = 50 hidden neurons. A simple average according to

z

i

=

M 1 X

M

z( )

k i

k=

(5)

1

was used to calculate the prediction ability of an ANNE. Prediction of a case i is represented by an array of output values z(k)i where k = 1; :::; M is the index of the network belonging to the ANNE and zi is a value predicted by an ensemble of ANNs. The prediction error was defined as the mean squared error of zi calculated over all cases of a data set. 4. Calculated Results Six different methods of ANN training, labeled A to F, were compared. (A) In order to exemplify the overtraining problem, all cases from input data sets were supplied for ANN training. The overtrained networks were able to exactly reproduce the training data set (Figure 1[b], [e] thin lines), but they were characterized by a low prediction ability for new data (Table IA). (B) Application of an ensemble technique for overtrained networks increased their predictive ability (Table IB). However, an averaged network prediction tended to reproduce the fine structure (which is generated by noise) of the data sets (Figure 1[b], [e] thick lines). (C) The curves calculated by early stopping (Figure 1[c], [f]) could provide a good interpolation of the real functions. However, an unsatisfactory partition of

46

IGOR TETKO AND ALESSANDRO VILLA

Figure 1. Training Data sets and Calculated Curves

g x

g x

Panels [a-c] refer to function 1 ( ) and panels [d-f] refer to function 3 ( ). [a], [d]: circles represent 50 cases of the training data set and the thick line represent the function ( ) over the interval [0.1,0.9]. [b], [e]: three examples of overtrained ANNs (thin lines) and an ensemble average of 100 such ANNs (thick lines); [c], [f]: three examples of early stopping ANNs (thin lines) and an ensemble average of 100 such ANNs (thick lines).

gx

the input data set on learning and validation subsets could occur by chance. In this case, an ANN was overtrained in certain regions of the input data set (Figure 1[c], arrow). Note that such cases reduce the prediction ability of the ANN.

47

AN ENHANCEMENT OF GENERALIZATION ABILITY

Table I. Average and Standard Deviation of the Prediction Errors. method

analyzed ANNs

A

5,000 overtrained ANNs

B

50 ensembles, each formed by 100 overtrained ANNs from A 5,000 early stopped ANNs

C D

100 best fitted ANNs, selected from C E 50 ensembles of 100 nets ‘ESE’ from C F 50 ensembles of 100 nets ‘bagging’ trained according to ‘bagging algorithm’

g1 (x) 0.09060.0048 (0.06730.0065) 0.08320.0008 (0.05670.0011) 0.0042 0.0107) 0.0039 0.0100) 0.0002 0.0010) 0.0005 0.0012)

0.0691 (0.0237 0.0693 (0.0247 0.0666 (0.0180 0.0748 (0.0395

g2 (x) 0.08800.0071 (0.07040.0088) 0.07700.0021 (0.05580.0029) 0.0058 0.0098) 0.0033 0.0060) 0.0004 0.0010) 0.0007 0.0013)

0.0674 (0.0398 0.0643 (0.0345 0.0603 (0.0268 0.0669 (0.0409

g3(x) 0.07400.0039 (0.05610.0052) 0.06820.0009 (0.04820.0013) 0.0050 0.0083) 0.0031 0.0060) 0.0004 0.0008) 0.0005 0.0009)

0.0661 (0.0440 0.0640 (0.0406 0.0617 (0.0375 0.0627 (0.0373

A all 50 cases from training data set were used to train ANN. The prediction errors were calculated for each ANN at the end of training, when the number of hidden neurons reached h = 50. B ANNs from example A were subdivided on 50 ensembles. The predicted values of each ensemble were calculated by Eqn. 5. C 50 cases from the training data set were randomly subdivided on learning (25 cases) and validated (25 cases) data set. The prediction errors were calculated for each ANN in its early stopping point. D 100 early stopped ANNs with minimal prediction errors for the initial training set were selected from the example C. E 5,000 early stopped ANNs from the example C were subdivided on 50 ensembles. F training of ANNs was done with data sets resampled according to bagging [11]. The result for test data sets generated without noise are indicated in parentheses. The averages were estimated over 50 ensembles in examples B, E and F, but over all analyzed ANNs in examples A, C and D.

N

(D) It is possible to select an early stopping network that calculates the lowest error for the initial training data set (i.e., combining of learning and validated data sets). A total number of 5,000 early stopped networks were analyzed and the 100 networks characterized by the lowest error were selected. Table ID indicates an improvement of prediction ability of such networks compared to early stopping ANNs for functions g2 (x) and g3 (x) but no improvement was observed for g1 (x). (E) The results calculated by ESE provided an improvement over the other methods (Table IE). Both prediction error and its standard deviations were reduced. (F) The ‘bagging’ method was implemented as follows [11]. Equal probabilities were attributed to each case from the initial training set T (50 cases) and were used to sample with replacement (bootstrap) 50 times in order to form the resampled training set T (B ) . The initial training data set was accordingly for each ANN. Some cases in T did not appear in T (B ) , some appeared more than once. The networks

48

IGOR TETKO AND ALESSANDRO VILLA

Table II. Average and Standard Deviations of the Prediction Errors Using the ESE Algorithm.

validated data sets – all cases – LOO test data sets

g1 (x)

g2(x)

g3 (x)

  

0.0614 0.0006 0.0626 0.0007 0.0603 0.0004

  

0.0534 0.0006 0.0589 0.0007 0.0617 0.0004

0.0649 0.0005 0.0692 0.0007 0.0666 0.0002

  

Values calculated according to validated data sets are used to estimate the error of the test data sets.

were trained and averaged as in method (B). The results were characterized by a higher prediction error and standard deviation compared to ESE, except for g3 (x). 5. Estimation of a Prediction Ability The error calculated for a validated data set can be used as an estimation of the prediction for the test data set (see Table II). This estimation can be biased for small data sets. Indeed the validation set is used during ANN training to determine ESP so that the prediction error could be underestimated. To decrease the effect of this bias, a cross-validated Leave-One-Out (LOO) estimation can be programmed, as follows. One case from the validated data set is removed into a temporary data set and an ESP is determined for this case. This procedure is repeated iteratively for all cases and the LOO estimation of the prediction ability for the validated data set is calculated. The LOO estimation decreased the speed of ANN training by few percents in our computer implementation. This method provided a similar accuracy for all tested functions, whereas the prediction based on the validated data set was underestimated for g3 (x) (Table II). 6. Discussion Overtrained ANNs provided the worst prediction accuracy (i.e., the highest error) in comparison to all other approaches analyzed in this study. The application of an ensemble technique to these ANNs did not significantly improve the performance. The results calculated according to early stopping neural networks were better than those produced by overtrained networks. However, it is impossible to decide which ANN should be used as a final model because there is no objective criterion to select the best network. A selection of the early stopping ANNs with minimal errors for the initial training data set could provide some improvement of the results. However, this approach has continued the process of ANN training by the selection itself. In a set of infinite number of early stopping networks we could find, by chance, the same networks that are produced by overtraining. Therefore the prediction ability of the selected networks can be as low as that of overtrained

49

AN ENHANCEMENT OF GENERALIZATION ABILITY

ones. The best prediction accuracy and the lowest variance were calculated using ESE. The performance of ESE can be explained in the general framework of the bias/variance problem of neural networks [11, 13] The expected error of a regression model can be decomposed as follows:

PE (f ) i

= (fi Y )2 = (f  Y )2 + (f  = E"2 + Bias(f ) + V ar (fi )

f)2 + (f

i

f)2

(6)

where f is a value estimated by a model i, Y is a test value generated with noise "; f  is a true test value generated for function with zero noise " = 0 and f is i

the mathematical expectation calculated over all regression models (i.e., neural networks). The last term E"2 measures the amount of noise or variability of Y given x and does not depend on the training data or on the regression models. A usage of ensemble average (provided the regression models are not correlated [13]) decreases the variance V ar (fi ) of the results. Ensemble average results for test data sets without noise provide an estimation of the bias of the analyzed method. Table I illustrates that the overtrained ANNs are characterized by a significantly larger bias compared to the early stopping networks. Thus, ‘early stopping’ decreases the bias and ‘averaging’ decreases the variance of ANNs. Then, the combination of both methods, i.e. ESE, provides the best prediction ability. Note, that the results calculated by bagging can be subjected to the problem of bias (see results for functions g1 (x) and g2 (x)). An improvement of prediction ability using ESE was recently reported for fixedsize ANNs [12]. There are some advantages of CCA compared to such networks. Firstly, is not required to determine the size of ANN prior to training. Secondly, a training with CCA is faster compared to other training algorithms. This algorithm can also solve some tasks (e.g. two-spiral problem [3]) which represent substantial difficulties for training of fixed-size nets. That is why CCA is of considerable interest for practical applications of ESE. The following question can be raised. How many nets should be used in an ensemble? A larger number of nets decreases the variance of the predicted values. Therefore, it is possible to indicate a priori, before training, the desired accuracy of an ensemble prediction so that an actual number of neural networks for this accuracy be determined during the ongoing calculations. Interestingly, our calculations showed robustness of the ESE technique for exact determination of the ESP. Cross-validation overhead was reduced by checking the validated errors after 10 epochs. The calculations with validation check after 1 or 10 epochs did not show any significant in the results, but the speed of calculation was increased (by 30% in our implementation). It is worth to mention that ESE technique comprises all the advantages of ensemble methods for parallel calculations. Each network of the ensemble operates with its unique input data sets and can be trained completely independently from the other networks. In addition, ESE avoids the main drawback of the early stopping

50

IGOR TETKO AND ALESSANDRO VILLA

method, i.e. the data from validated data set do not participate in ANN learning due to the splitting technique. Indeed, random resampling of the initial training data set followed by ensemble average uses all available data. One problem was not addressed in this study: how many cases should be used for validation and learning data sets. A theoretical approach to this problem was raised by Amari et al [14]. Their study indicated that for optimal stoppingby crossp validation the number of cases in a test data set should equal to 1= 2m  100%, where m is the number of network parameters. This formula, however, cannot directly be applied to analyze CCA networks, in which the number of network parameters (weights) increases with the time of network training. 7. Acknowledgements This study was supported by INTAS-UKRAINE (95-IN/UA-60) and Swiss National Science Foundation (FNRS 31-37723.93) grants. We thank Alessandra Celletti for her helpful suggestions and Elizabeth Mathew-Sluyter for the editorial assistance. References 1. R. Hecht-Nielsen, “Kolmogorov’s mapping neural network existence theorem”, Proc. Int. Conf. on Neural Networks, pp. 11–14, 1987. 2. G. Wahba, “Generalization and regularization in nonlinear learning systems”, The Handbook of Brain Theory and Neural Networks, M. Arbib (editor) MIT press, pp. 426–430, 1995. 3. S. Fahlman and C. Lebiere, “The cascade-correlation learning architecture”, NIPS, Vol. 2, pp. 524–532, 1990. 4. R. Reed and R. Pruning, “Algorithms – a survey”, IEEE Trans. Neural Networks, Vol. 4, pp. 740– 747, 1993. 5. Y. Le Cun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard and L. Jackel, “Handwritten digit recognition with a backpropagation network, NIPS, Vol. 2, pp. 396–404, 1990. 6. S. B¨os, “Avoiding overfitting by finite temperature learning and cross-validation”, Proc. ICANN’95 (Paris), Vol. 2, pp. 111–116, 1995. 7. D.J.C. MacKay, “A practical Bayesian framework for backpropagation networks”, Neural Computation, Vol. 4, pp. 448–472, 1992. 8. C.M. Bischop, Neural networks for pattern recognition, Oxford: Oxford University Press, 1995. 9. R. Hecht-Nielsen, Neurocomputing, Addison-Wesley, 1989. 10. I.V. Tetko, D.J. Livingstone and A.I. Luik, “Neural network studies. 1. Comparison of overfitting and overtraining”, J. Chem. Inf. Comput. Sci., Vol. 35, pp. 826–833, 1995. 11. L. Breimann, “Bagging predictors”, Machine Learning, Vol. 24, pp. 123–140, 1996. 12. I.V. Tetko and A.E.P. Villa, “Efficient partition of learning data sets for neural network training”, Neural Networks, 1997, in press. 13. S. Geman, E. Bienenstock and R. Dourstat, “Neural networks and the bias/variance dilemma”, Neural Computation, Vol. 4, pp. 1–58, 1992. 14. S. Amari, N. Murata, K.-R. Muller, M. Finke and H. Yang, “Asymptomic statistical theory of overtraining and cross-validation”, Technical report University of Tokyo, METR–95/06, 1995.

Suggest Documents