Developing FFNN Applications Using Cross–Validated Validation ...

1 downloads 5970 Views 133KB Size Report
call it cross–validated validation training (CVVT) since it combines. statistical cross–validation with the validation training technique. used in FFNNs.
Developing FFNN Applications Using Cross–Validated Validation Training Jeffrey C.H. Yeh1,2

Leonard G.C. Hamey1,2

Tas Westcott3

1Cooperative Research

Centre for International Food Manufacture and Packaging Science PO Box 218, Hawthorn, Victoria 3122, Australia 2Department of Computing, Macquarie University, NSW 2109, Australia Email: [email protected] 3Westcott Consultants Pty Ltd, PO Box 334, Greenacre, NSW 2190, Australia

Abstract – In this paper, we present a novel, effective, and reliable training technique for feed–forward neural networks (FFNN). We call it cross–validated validation training (CVVT) since it combines statistical cross–validation with the validation training technique used in FFNNs. CVVT improves the generalisation estimation of validation training, enabling reliable comparison and selection of network architectures. Since it utilises validation training, CVVT also preserves the generalisation performance of FFNNs with excess weights. These benefits are demonstrated using statistical analysis of real–life results from a bake inspection system. Contrary to previous work, we found that significant excess weights may actually deteriorate the generalisation preserving ability of validation training.

I. INTRODUCTION This paper presents cross–validated validation training (CVVT), a novel training technique for feed–forward neural networks. CVVT improves the reliability of generalisation performance estimation, and enables the comparison and selection of network architecture. Our statistical analysis of real–life results also provides an unbiased evaluation of validation training’s ability in preserving generalisation and preventing over–training. We found that, contrary to previous work, such ability of the validation training technique may deteriorate with too many excess weights. When training FFNNs with back–propagation (BP), the networks’ accuracy, generalisation and training duration are affected by many parameters including the training technique used. One training technique suggested in recent years is early–stopping training [5] or cross–validation training [4][5][6][7]. We call it validation training in this paper to avoid confusion with statistical cross–validation [8][9]. Statistical cross–validation is efficient and effective for obtaining an unbiased performance estimate of an optimised system [8] such as a trained FFNN from limited samples. Hence statistical cross–validation has been used in some neural network research to improve the accuracy and the legitimacy of the experimental results [9]. Its concept is simple: use one or more data sets for optimisation, and a separate but related data set independent of the optimisation process for estimating the unbiased true performance of the optimised system. A K–fold cross–validation improves the estimate’s accuracy and relia-

bility by dividing the samples into K subgroups. After the optimisation process using K–1 subgroups, the performance estimate is measured on the remaining subgroup. This is repeated for all K subgroups, and the average of the K estimates is more reliable than one single estimate. Validation training [4][5][6][7] is a modification of statistical cross–validation. Whereas statistical cross–validation is applied to the optimised system, validation training is applied throughout the training process. To optimise the generalisation performance, validation training uses two data sets, a training set, and a validation set. A network is trained on the training set, and periodically the network’s generalisation performance is estimated by evaluating the validation error from the independent validation set. After the training run, the weight configuration with the minimum validation error is chosen, as these weights should produce the best generalisation performance. However, since the validation error is used to choose the weights and is no longer independent of the optimisation process, the validation error is a biased estimation of the true generalisation performance. A third data set, the test set, is thus reserved before the training for evaluating the test error, an unbiased estimation the true generalisation performance of the chosen network. Validation training is believed to preserve generalisation even when the network architecture is excessively large – a condition which otherwise results in over–training. This is because validation training has the implicit benefit of network architecture selection. By choosing the network weights at the minimum validation error point after training, that is stopping training early, excess weights either are trained to perform duplicating tasks as other weights, or the weights are still insignificantly small. These excess weights are thus effectively inactive, making validation training a type of regularisation [4][5][6]. In other words, larger networks trained by validation training will produce similar generalisation performance as the optimal topology, and network size is no longer a concern for FFNNs. However, these claims are entirely based on theoretical and simulation work on artificial data sets[4][5][6][7] some of which are biased according to [5][9]. Is is preferable to have an empirical evaluation of theories using real–life data [9]. The only real–life example [7] that we have encountered is not reliable because the analysis was based on one measure-

ment instead of repeated measurements and has insufficient statistical analysis as recommended in [9]. Our proposed CVVT combines statistical cross–validation with validation training. Since statistical cross–validation, especially a K–fold cross–validation [8][10][11], improves the accuracy of the generalisation estimation, and hence the reliability of the validation training results, suitable statistical tests on these results provides a reliable evaluation of the effectiveness of validation training. Furthermore, the repeated measurements from our CVVT technique also enable a reliable comparison and selection of network architectures. As our experiment utilises real–life data, it also provides an unbiased real–life empirical evaluation of the benefits of validation training.

estimation is, but the more computation is required. Conventional validation training only produces one optimal network with one validation error and one test error. The validation error may be biased and thus reduce the reliability of validation training since the network chosen may be optimal to that particular validation set but not generalise well. Similarly the single test error measurement also has a high variance and thus low reliability as an estimate of the true generalisation performance of the network during its operation. Increasing the number of subgroups increases the accuracy and reliability of the estimation, but a K–fold CVVT will require K times resources than the normal validation training. A reasonable K is between 5 and 10 as suggested in [10][11]. III. COLOUR BAKE INSPECTION SYSTEM

II. CROSS–VALIDATED VALIDATION TRAINING CVVT applies K–fold statistical cross–validation to validation training, and inherits the benefits of both techniques. The available samples are divided randomly into 2 groups: one for validation training and a test set for true generalisation estimation after the validation training. The first group is divided into K subgroups for a K–fold cross–validation. For each of the K validation training runs, the subgroups are permuted with K–1 subgroups as the training set, and the remaining single subgroup as the validation set. After each training run, the network weights with the minimum validation error is chosen as the optimal weights (as in validation training), and the network’s generalisation performance is finally estimated on the test set. This process produces K trained networks, K validation errors, and K test errors. The average of K validation errors may be used to compare different network architectures, providing reliable network architecture analysis and selection. The K corresponding test errors are still independent of the weight and architecture optimisation process, and so their average is still a reliable and unbiased estimate of the true generalisation error of the particular network weights and architecture used. Finally, the K networks can be combined into one larger network by averaging the output values from the networks, creating an ensemble [12]. CVVT includes a trade–off between accuracy and resources. The more subgroups there are, the more accurate the

Inspection of the colour of baked products is important to the baking industry, since colour is related to the taste, texture, and aroma of the products [13]. Many manufacturers today employ human experts to visually inspect baked goods. This is inefficient and ineffective as human judgements are prone to short and long term variations, resulting in inconsistency in the product quality. A better alternative for routine visual quality inspections is a machine inspection system using digital imaging technology and computer software for decision making. We have developed a colour bake inspection system using colour imaging technology and a hybrid neural network consisting of a self–organising map (SOM) and a feed–forward neural network (FFNN) for assessing the bake colour of biscuits [1][2][3] as shown in fig. 1. We use the CVVT technique to train our FFNN. The biscuit samples were imaged using a digital colour imaging system. By plotting the red, green, and blue (RGB) values of each image pixel in a RGB colour cube, Sung [14] discovered that during baking, the bake colour of each biscuit type changes along its unique trajectory, a baking curve, in the colour cube. Since a baking curve contains the characteristics of the baking colour, we use a one–dimensional SOM to extract a line (baking line) along the baking curve for each product. Each image’s pixels are then histogrammed along the baking line. Histogram buckets are placed along the line and a Gaussian weighting

A) Biscuit Samples

Colour Imaging System

Colour Image

Biscuit Samples

Colour Imaging System

Colour Image

SOM

Baking Line

Produce Histogram

Colour Histogram

Produce Histogram

Colour Histogram

FFNN (Training)

Bake Level

B)

Fig. 1 : The bake inspection system: A) shows the system during training, and B) shows the trained system in operation.

FFNN (Trained)

Bake Level

function is used to compute the histogram based on the pixel to bucket distances. The resultant colour histograms are down– sampled into dimensions matching the required input dimensions of FFNNs. The FFNNs are then trained with the colour histograms as the inputs, and averaged bake grades from 10 repeated independent assessments by human experts. We studied two types of biscuits in this experiment. Three hundred biscuit samples of different bake levels were collected for each biscuit product. The samples were imaged, calibrated, and segmented, and the resultant biscuit pixels were used to train the SOM to produce a baking line. The calibrated but unsegmented images were then histogrammed along the baking line into 300 colour histograms. Because of our interest in the effectiveness of validation training in preserving generalisation in networks with excess weights, we tested different network architectures to determine the minimum architecture required and the performance of larger networks. We reserved 100 samples for the test set, and divided the remaining 200 samples into 5 sets of 40 for a 5–fold CVVT. Using the Aspirin simulator [15], we trained all networks with the same settings. A small learning rate of 0.1 was used to ensure a continuous descent on the error surface. We also used a large momentum of 0.95 to speed up training and to stabilise the training when approaching the error minimum. Using the same randomisation seed for a fair comparison, the networks were initialised with small weights from a random uniform distribution of –0.1 to 0.1 inclusive to ensure convergence during training. All nodes had a sigmoid activation function with output from 0.0 to 1.0 allowing non–linear fitting. We used only one hidden layer where necessary, since one hidden layer is sufficient for universal function approximation and multiple hidden layers only improves the training speed [4]. There were no direct connections between non–adjacent layers, and all targets were scaled to the range of 0.1 to 0.9 for easier training. Fig. 2 shows the training and validation errors for one of the 5 validation training runs. The training error decreases as training continues, but the validation error drops to a minimum marked by the circle in the graph, and then increases again as the network is becoming over–trained and generalisation performance becomes worse. The network with the minium validation error is finally chosen for the best generalisation performance.

mensions (I effect, and H effect), their interactional effect (I*H effect), and the effect of different validation sets (V effect) on their validation errors. We found a significant interaction effect in the ANOVA test of the raw data. This is also evidenced by the multiplicative effect of the number of input and hidden nodes on the validation errors in fig 3. We found that the interactional effect could be removed by taking logarithms of the validation errors. This made the I, and H effects independent of each other, a condition necessary for subsequent analysis with the Scheffe test to determine the optimal architecture. The ANOVA results for product A are shown in table I, which is a standard ANOVA table, where significant test results are marked with asterisks. The table shows a very significant I effect (p=0.0) meaning that the validation error of at least one input dimension is significantly lower than other input dimensions tested. Similarly, the H effect is also very significant. The

Fig. 2 : The training and validation error curves during training.

IV. EXPERIMENTAL RESULTS AND ANALYSIS We compare the effect of validation training on different architectures. For each biscuit product, we tested 9 different network input dimensions (I = 3, 4, 6, 8, 11, 16, 23, 32, 45), and 7 different hidden dimensions (H = 0, 1, 2, 3, 4, 6, 8), a total of 63 different architectures. Each architecture produced 5 validation measurements (V). We used a I*H*V factorial design experiment and the analysis of variance test (ANOVA) [16][17] for testing the effects of different input and hidden di-

Fig. 3 : The validation error means of the different architectures.

V effect is also significant, reflecting the differences between the individual validation sets used in the cross–validation. Since the interactional effect (I*H) is not significant, the Scheffe test can be used to determine the best input and hidden dimensions [16][17]. The Scheffe test results for input dimensions are shown in table II where significant results are also marked with asterisks. The table shows that 3 input nodes is significantly different from other input dimensions tested. Since 3 input nodes has a higher validation error mean as shown in the table, we conclude that the minimum input dimension for product A is 4. Similarly, when we apply the Scheffe test to the hidden dimensions, we find that the minimum hidden dimension is 2. These statistical tests suggest that at 95% confidence level, architectures of at least 4–2–1 are not significantly different in their validation errors (i.e. their estimated generalisation performances). This demonstrates that validation training is effective in preserving generalisation performance even for architectures with excess weights, since the larger networks have similar performances to the minimal architecture. This can also be observed by plotting the validation error means of the architectures as in fig. 3. The plot shows that the validation errors for 3 input nodes are consistently higher than other architectures. It can also be seen that the validation errors drops rapidly as the number of hidden nodes increases to 2, then levels out for 4 or more hidden nodes. However, the plot shows that as the number of hidden nodes increases, the variability in the error means of different architectures starts to increase again, indicating a reduction of the generalisation preserving benefit of validation training. Therefore, there is a trade off between the complexity of a FFNN and its generalisation performance – a larger topology is known to be able to produce a more complex function, but too many excessive weights may deteriorate the generalisation–preserving ability of validation training. It is thus preferable to choose a network architecture moderately larger than the minimum required architecture determined by CVVT. This enables a generic architecture suitable for all product types, including those more difficult products requiring a more complex network, and validation training is still reliable in preserving its generalisation. From the results for product A which has an evenly–baked surface, we chose an architecture of 8–4–1, moderately larger than its minimum architecture of 4–2–1. This larger architecture can then be applied to all future systems for other baked biscuit products without repeating the architecture selection process again. We later experimented with a more difficult biscuit product B which has an unevenly baked surface, and found that its minimum architecture is 4–3–1, and that larger networks, including the 8–4–1 architecture, perform equally well as the minimum 4–3–1 architecture. Thus, this demonstrates the reliable architecture selection benefit of the CVVT technique.

TABLE I: ANOVA OF THE LOG–TRANSFORMED VALIDATION RMS ERRORS FROM THE COLOUR BAKE INSPECTION SYSTEM. ANOVA (colour system): Log–transformed Validation Errors Effect

df Effect

MS Effect

df Error

MS Error

F

p–level

I

8*

0.116*

248*

0.009*

12.34*

0.000*

H

6*

0.365*

248*

0.009*

38.85*

0.000*

V

4*

1.702*

248*

0.009*

181.40*

0.000*

I*H

48

0.008

248

0.009

0.86

0.722

TABLE II: RESULTS OF THE SCHEFFE TEST ON INPUT DIMENSIONS TESTED FOR THE COLOUR BAKE INSPECTION SYSTEM. DATA ARE LOG–TRANSFORMED BEFORE THE TEST. Scheffe Test: H0 = means of two input dimensions are equal Reject H0 if Scheffe value >= Critical Value (0.092 @ 5% significance) Mean (org)

.038

.034

.032

.031

.031

.032

.032

.032

.032

I_No de

3

4

6

8

11

16

23

32

45

.103 *

.149 *

.186 *

.182 *

.170 *

.157 *

.158 *

.145 *

.046

.083

.078

.067

.054

.055

.042

.037

.032

.021

.008

.008

.004

.005

.015

.029

.029

.041

.011

.024

.024

.037

.013

.012

.025

.000

.012

3 4 6 8 11 16 23 32

.013

V. IMPLICATIONS AND DISCUSSION The K–fold CVVT technique integrates the benefits of the statistical K–fold cross–validation and validation training techniques. It can be practically applied to train FFNN applications to preserve generalisation and increase the reliability of validation training and hence the generalisation estimation. The repeated training runs in CVVT technique require more computational resources than the conventional validation training technique. However, given the same amount of data as in the validation training technique, CVVT retrieves more information by permuting training and validation sets and effectively estimates the effect of training on all available training data. By using factorial design, and the ANOVA and Scheffe tests, K–fold CVVT can also reliably and efficiently compare the effect of different network architectures and other network parameters such as learning rates and momentums, providing improved network parameter selection. Developers

can use this benefit to determine the minimum architecture required for a particular application. A moderately larger architecture can then be conveniently chosen to develop networks for other related circumstances. VI. CONCLUSION In this paper, we present a novel, effective, and reliable training technique called cross–validated validation training. CVVT improves the reliability of conventional validation training, and is effective and efficient for comparing networks with different network parameters to determine the optimal settings. We successfully demonstrate CVVT’s benefits using statistical analysis of the data from a real–life colour bake inspection system. The CVVT allows us to efficiently determine a suitable architecture for a product category instead of for each product. We have also shown that validation training is effective in preserving generalisation, but contrary to previous work, found that a significantly larger architecture may reduce this ability. VII. ACKNOWLEDGMENT We would like to thank Arnotts Biscuits Ltd. for their continuing support in this research project, especially with regards to providing data, technical facilities, and financial support. VIII. REFERENCE [1] J.C.H. Yeh, L.G.C. Hamey, T. Westcott and S.K.Y. Sung, ”Colour Bake Inspection Using Hybrid Artificial Neural Networks”, in Proceedings of the 1995 IEEE International Conference on Neural Networks, vol. 1, pp.37–42. [2] C.T. Westcott and L.G.C. Hamey, Data Recognition System, Patent Specification PCT/AU95/00813, Arnott’s Biscuits Limited, 1995. [3] J.C.H. Yeh, Colour Bake Inspection Using Artificial Neural Networks, M.Sc.(Hons) Thesis, Macquarie University, Sydney, Australia, 1997. [4] M.H. Hassoun, Fundamentals of Artificial Neural Networks, MIT Press, 1995, pp.197–234. [5] W.S. Sarle, ”Stopped Training and Other Remedies for Overfitting”, in Proceedings of the 27th Symposium on the Interface, 1995.

[6] J. Sjoberg and L. Ljung. Overtraining, Regularization, and Searching for Minimum in Neural Networks, Technical Report LiTH–ISY–I–1297, Linkoping University, Sweden, 1992. [7] N. Morgan and H. Bourlard, ”Generalization and Parameter Estimation in Feedforward Nets: Some Experiments”, Advances in Neural Information Processing Systems 2, Morgan Kaufmann Publishers, 1990, pp.630–637. [8] F. Mosteller and J.W. Tukey, Data Analysis and Regression, a Second Course in Statistics, Addision–Wesley Publishing Company, 1977, pp.36–40, 133–163. [9] A. Flexer, Statistical Evaluation of Neural Network Experiments: Minimum Requirements and Current Practice, Technical Report oefai–tr–95–16, the Austrian Research Institute for Artificial Intelligence, Vienna, Austria, 1995. [10] J. Utans and J Moody, ”Selecting Neural Network Architectures via the Prediction Risk: Application to Corporate Bond Rating Prediction”, in Proceedings of the First International Conference on Artificial Intelligence Applications on Wall Street, IEEE Computer Society Press, 1991. [11] S.M. Weiss and C.A. Kulikowski, Computer Systems That Learn: Classification and Predication Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems, Morgan Kaufmann Publishers, San Mateo, 1991. [12] A. Krogh and J. Vedelsby, ”Neural Network Ensembles, Cross Validation, and Active Learning”, Advances in Neural Information Processing Systems 7, MIT Press, Cambridge MA. 1995. [13] C.M. Christensen, ”Effects of Color on Aroma, Flavor and Texture Judgments of Foods”, Journal of Food Science, May/June, 1983. [14] S.K.Y. Sung. A Study of Baking Curve, B.Sc.(Hons) Thesis, Macquarie University, Sydney, Australia, 1994. [15] R.R. Leighton, The Aspirin/Migraines Neural Network Software: User Manual Release, V6.0, Technical Report MP–91W00050, the MITRE Corporation, 1992. [16] B. Ostle, Statistics in Research: Basic Concepts and Techniques for Research Workers, The Iowa State University Press, second edition, 1963. [17] D.C. Montgomery, Design and Analysis of Experiments, John Wiley & Sons, second edition, 1984, pp.1533–1564.

Suggest Documents