This paper is about the use of artificial neural networks for multivariate calibration. We discuss network architecture and estimation as well as the relationship ...
T. Næs et al., J. Near Infrared Spectrosc. 1, 1–11 (1993)
1
T. Næs et al., J. Near Infrared Spectrosc. 1, 1–11 (1993) Artificial Neural Networks in Multivariate Calibration
Artificial neural networks in multivariate calibration Tormod Næs, Knut Kvaal, Tomas Isaksson and Charles Millera MATFORSK—Norwegian Food Research Institute, Osloveien 1, N–1430 Ås, Norway. a
Present address: Du Pont Polymers, Industrial Polymers Research, PO Box 1089, Orange, TX 77631-1089, USA.
This paper is about the use of artificial neural networks for multivariate calibration. We discuss network architecture and estimation as well as the relationship between neural networks and related linear and non-linear techniques. A feed-forward network is tested on two applications of near infrared spectroscopy, both of which have been treated previously and which have indicated non-linear features. In both cases, the network gives more precise prediction results than the linear calibration method of PCR. Keywords: Artificial neural network, back-propagation, feed-forward network, PCR, PLS, near infrared spectroscopy, multivariate calibration.
Introduction This paper addresses the use of artificial neural networks (ANN) in multivariate calibration, with special emphasis on their performance in near infrared (NIR) spectroscopy. In particular, we will focus on so-called “feed-forward” networks that use “back-propagation” (BP) for estimation. This topic has been addressed by other researchers,1,2,3 but there are still many unanswered questions about, for instance, prediction properties in practical NIR applications. The first half of the paper is devoted to a discussion of the choice of network and learning (estimation) rule, and a comparison between neural networks and other statistical methods such as partial least squares (PLS), principal component regression (PCR) and non-linear PLS. This discussion will also serve as a background and explanation for the problems considered in the second part, which is an empirical investigation of the performance of ANN’s in NIR spectroscopy.
Previous comparisons between the ANN method and other techniques have given quite different results with respect to prediction ability in NIR analysis. For example, in Long et al.1 the ANN method was compared to PCR and the latter gave the best prediction results, while in Borggaard and Thodberg 2 a special application of the ANN method gave much better results than both PCR and PLS. The main goal of this paper is to provide more information about the prediction ability of the ANN method in practice, and also to compare it with the performance of a standard linear technique, namely PCR. In this way we want to contribute to the discussion about the performance of non-linear calibration techniques in NIR spectroscopy. We will also investigate how ANN’s should be used in order to give as good prediction results as possible. For more theoretical discussions regarding approximation properties of ANN network models and statistical convergence properties of estimation algorithms, we refer to References 4 and 5. For a general background discussion of the ANN method, we refer to References 6, 7 and 8.
© NIR Publications 1993, ISSN 0967-0335
2
Artificial Neural Networks in Multivariate Calibration
Figure 2. An illustration of the signal processing in a sigmoid function.
Figure 1. An illustration of a simple feed-forward network with one hidden layer and one output node.
Choice of network Feed-forward networks The field of neural networks covers a wide range of different network methods which are developed for and applied to very different situations. In particular, the “feed-forward” network structure is suitable for handling non-linear relationships between “input” and “output” variables, 6 when the focus is prediction. Reference 6 also discusses the differences between feed-forward and other networks. In this work we will confine ourselves to discuss some of the main features of the feed-forward network and how it can be fitted to data. A feed-forward network is a function in which the information from the input data goes via intermediate variables to the output data. The input data (X) is frequently called the input layer and the output data (Y) is referred to as the output layer. Between these two layers are the hidden variables which are collected in one or more hidden layers. The elements or nodes in the hidden layers can be thought of as sets of intermediate variables analogous to the latent variables in bilinear regression (e.g. PLS and PCR). An illustration of a feedforward network with one output variable and one hidden layer is given in Figure 1. The information from all input variables goes to each of the nodes in the hidden layer and all hidden nodes are connected to the single variable in the output layer in each case. The contributions from all nodes or elements
are multiplied by constants and added before a possible transformation takes place within the node. The transformation is in practice often a sigmoid function, but can in principle be any function. The sigmoid signal processing in a node is illustrated in Figure 2. The feed-forward neural network in Figure 1 gives a regression equation of the form I J y = f bi gi wij x j + a i1 + a 2 + e i =1 j =1
∑
∑
(1)
where y is the output variable, the x’s are the input variables, e is a random error term, gi and f are functions and b i , wij , ai1 and a2 are constants to be determined. The constants w ij are the weights that each input element must be multiplied by before their contributions are added in node i in the hidden layer. In this node, the sum over j of all elements wijxj is used as input to the function gi. Then, each function gi is multiplied by a constant b i before summation over i. At last the sum over i is used as input for the function f. More than one hidden layer can also be used, resulting in a similar, but more complicated function. Note that for both the hidden and the output layer, there are constants, a i1 and a2, respectively, that are added to the contribution from the rest of the variables before the transformations take place. These constants play the same role as the intercept (constant) term in a linear regression model. As can be seen from Equation 1, an artificial feed-forward neural network is simply a non-linear parametric model for the relationship between y and all the x-variables. There are functions gi and f that have to be selected6 and parameters wij and bi that
T. Næs et al., J. Near Infrared Spectrosc. 1, 1–11 (1993)
3
must be estimated from the data. The process by which these parameters are determined is, in the terminology of artificial neural computing, called “learning”. The best choice for g i and f can in practice be found by trial and error, in that several options are tested and the functions that result in the best prediction ability are selected. It has been shown that the class of models in Equation 1 is dense (for suitable choice of functions g i and f) in the class of smooth continuous functions. This means that any continuous function relating y and x1, … , xJ can be approximated arbitrarily well by a function of this kind. 4 As with linear calibration methods, network models must be constructed with consideration of two important effects: underfitting and overfitting.9 If a model that is too simple or too rigid is selected, underfitting is the result, and if a model that is too complex is used, overfitting can be the consequence. The optimal model complexity is usually somewhere between these two extremes. Techniques have been developed that simultaneously estimate the complexity (which is related to the number of hidden nodes) and the model parameters,10 but this subject is not covered here.
as small as possible. The error from the ouput layer is then “back-propagated” to the hidden layer and the same procedure is repeated for the w ij’s. The differences between the new and the updated parameters are dependent on so-called learning constants which can have a strong influence on the speed of convergence of the network. Such learning constants are usually dynamic, which means that they decrease as the number of iterations increases (see the examples below). The constants can be optimised for each particular application. In some cases, an extra term (momentum) is added to the standard back-propagation formulae in order to improve efficiency of the algorithm. We refer to Reference 8 for details. Although this kind of back-propagation seeks to minimise a least squares criterion, it is important to note that this kind of learning is different from standard least squares (LS) fitting where the sum of squared residuals Σ(y–y)2 is minimised over all parameter values bi and wij. Theoretical studies of the convergence properties of ANN’s can be found in Reference 10. Extensions of the back-propagation technique described above have been developed in order to make it more similar to the standard LS fitting of all objects simultaneously.8
The back-propagation learning rule The learning rule that is most frequently used for feed-forward networks is the back-propagation technique.6 In back-propagation, the objects in the calibration set are presented to the network one by one in random order, and the weights w ij and bi (regression coefficients) are updated each time in order to make the current prediction error as small as possible. This process continues until convergence of the regression coefficients. Typically, at least 10,000 runs or training cycles are necessary in order to reach a reasonable convergence. In some cases many more iterations are needed, but this depends on the complexity of the underlying function and the choice of input training parameters. More specifically, the back-propagation is based on the following strategy: (1) Subtract the output of the current observation from the output of the network, (2) compute the squared residual (or another function of it) and (3) let the weights for that layer be updated in order to make the squared output error
A modification of the standard network A modification of the standard model described in Equation 1 has been suggested in Reference 2. The idea of this modification is to add direct linear connections between the input and the output layer. This leads to the network model J I J y = f ∑ bi gi ∑ wij x j + a i1 + a 2 + ∑ d j x j + a 3 + e j =1 i =1 j =1
(2 )
The d’s here are the constants that are multiplied by the input variables in the direct connections between the input and output layer and the a’s are constant terms. In this way, the network is composed of a linear regression part and a standard non-linear network part. As for the standard network, back-propagation is used for learning. In Reference 2 this modified network performed well on near infrared data.
4
Artificial Neural Networks in Multivariate Calibration
Relations to other methods The artificial neural network models have strong relations to other statistical methods that have been used frequently in the past. In this section, however, we will only focus on a limited set of relations to methods that are frequently discussed and used in the chemometric and NIR literature. For a discussion of more statistically oriented aspects of such relations, we refer to References 5 and 10.
Multiple linear regression (MLR) MLR is based on the following model for the relationship between y and x: y = a+
I
∑ bi xi + e
(3)
i =1
where the a and b i’s are regression coefficients to be estimated. We see that this model is identical to a special version of model Equation 1 in which the hidden layer is removed and the transfer function is set equal to the identity function. It is also easy to see that if we replace both f and gi in Equation 1 by identity functions, we end up with a model which is essentially the same as the MLR model. This is easily seen by multiplying w ij and bi values and renaming the products. This procedure is called reparameterisation of the network function. However, in the learning procedure, the MLR equation is most frequently fitted by using standard LS instead of applying, for example, the iterative fitting procedure described above. Below we will investigate differences between the empirical properties of these two estimation techniques.
Partial least squares regression and principal component regression The regression equation for both PLS and PCR (with centred y and x) can be written as y=
J bi wij x j + e = i =1 j =1 I
∑ ∑
I
∑ bi ti + e
ti = Σ wijxj (i.e. y-loadings). It can be seen that apart from the non-linear g i and f in the feed-forward network Equation 1, the two equations (1) and (4) are identical. This means that the model equation for PLS and PCR is a special case of an ANN equation with linear g i and f. However, for the learning procedure, PCR and PLS are quite different from ANN. The idea behind the PCR/PLS methods is that the weights w ij are determined with a strong restriction on them in order to give as relevant scores for prediction as possible. 9 In PLS, the w ij ’s are the PLS loading weights found by maximising the covariance between y and linear functions of x, and in PCR they are the principal component loadings obtained by maximising the variance of linear functions of x. In other words, estimation of bi and wij is not done by fitting Equation (4) as well as possible, but through the use of quite a different rule. If both b i and wij in Equation 4 were determined by LS directly, the ordinary MLR method would be the result, since Equation 4 is only a reparameterisation of the linear regression equation. Since the ANN method concentrates on fitting without restrictions (on the parameters), there are reasons to believe that ANN’s are more sensitive to overfitting than are PLS and PCR. As a compromise, it has been proposed to use principal components as input for the network instead of the original variables themselves. 2 This procedure may, however, have some drawbacks, since it restricts the network input to only a few linear combinations of the original spectral values. This problem is studied in the application section below. In PCR and PLS models, the scores ti and loading weights w ij are orthogonal (for each j). This is not the case for ANN models, in which the w ij ’s are determined without this restriction. The orthogonality properties of PLS and PCR are strong reasons for their usefulness in the interpretation of data, and this same argument can be used against the present versions of ANN.
( 4)
i =1
where the wij’s now correspond to the loading weights 9 of factor i and the b i ’s correspond to the regression coefficients of y on the latent variables
Projection pursuit (PP) regression The PP regression method 11 is based on the equation
T. Næs et al., J. Near Infrared Spectrosc. 1, 1–11 (1993)
J y= si wij x j + e i =1 j =1 I
∑ ∑
(5)
for the relation between y and the x-variables. In this case, the si ’s are “smooth” functions of the linear function ti = Σ wijxj . Again, we see that the model Equation 5 is similar to an ANN model with linear f and gi replaced by s i. Notice also that since the si’s have no underlying functional assumption, PP regression is based on an even more flexible function than the ANN in Equation 1. Note also that since the si ’s are totally general except that they are smooth, we no longer need the constants a i1 and a2 . Regarding the estimation of the weights w ij and functions si , the PP regression is based on a least squares strategy. First, the constants w ij and the smooth functions s i are determined, then the effect of this factor is subtracted and the same procedure is applied to the residuals. This continues until the residuals have no systematic effects left. For each factor, the si and wij are found by an iterative procedure. The s functions are usually determined using moving averages or local linear fits to the data. The PP regression method as described here is more flexible than the class of ANN functions, and the optimal functions si and constants w ij are determined simultaneously. This procedure is very different from the ANN procedure, in which different functions g and f have to be tried independently. However, one main drawback with PP regression is that it gives no closed form solution for the prediction equation, only smooth fits to samples in the calibration set. Therefore, prediction of y values for new samples must be based on linear or other interpolations between the calibration points. The relationship between the PP and ANN methods is treated in more detail in Maechler et al.5
Non-linear PLS regression In 1990, Frank 12 published a method called nonlinear PLS, which is a kind of hybrid of PP regression and PLS. The technique is essentially based on the same model as PP regression, but the constants wij are determined the same way as the PLS loading weights are determined from residual matrices after all previous factors have been subtracted. Thus, this
5
version of non-linear PLS is based on a similar, but more flexible model than the ANN model, but the estimation of coefficients is carried out under a strong restriction, namely the PLS criterion based on maximisation of covariance between y and linear functions of x variables. The non-linear PLS method proposed in Wold13 is also essentially based on a model of the same kind as the PP regression. In this case, however, the smooth functions are approximated by splines, and the estimation of parameters follows a quite different rule than the procedure presented in Frank. 12
Examples In this section we will test the feed-forward ANN described by Equation 1 on two real situations in which it is desired to relate NIR spectral data (X) to a quality parameter (Y). Both datasets have been used elsewhere and the wavelengths (X-data) are strongly collinear.14 In this study, the main emphasis will be on the different aspects of prediction ability of the ANN method, considered as a function of the number of nodes in the hidden layer, the number of input variables (or principal components) and the number of iterations. All prediction results below will be presented in terms of root mean square error of prediction (RMSEP), which is defined by Ip 2 RMSEP = ( yi − yˆi ) / Ip i =1
∑
1 2
(6)
where Ip is the number of samples used in the prediction set. All reported results are obtained by testing the networks on samples that were not used for training of the network. All networks obtained in this paper are computed using NeuralWorks Professional II/plus (Neuralware Inc., PA, USA). During the network training, all input and output values are scaled so that they are in the range 0–1 (for input) and 0.2–0.8 (for output), respectively, (for arguments see the manual of the program). After convergence of the network, all prediction results are transformed back to original scale. The learning constants and momentum values that are used throughout this work are given in Table 1. It is
6
Artificial Neural Networks in Multivariate Calibration
Table 1. The learning constants of the neural network: For both the output and hidden layer we used the same learning constant and momentum value. The constants change as the number of iterations increases.
Number of iterations
0–2000
2000–20,000
20,000–40,000
40,000–100.000
Learning constant:
0.9
0.5
0.25
0.1
Momentum:
0.6
0.3
0.15
0.1
important to note that these constants are changed as the number of training cycles increases. Such a procedure is recommended (see the manual of the program) and is natural because it enables a sort of “fine-tuning” of the network as it approaches convergence. All non-linear networks used are based on a sigmoid transfer function.
Water predictions in meat The data set used in this study contains the near infrared transmission (NIT) spectra of 103 different meat samples at 50 different wavelengths (X-variables) and the concentration of water in these samples as determined by oven-drying (Y-variable). Seventy samples are used for calibration and the remaining 33 samples are used for testing the performance of the calibration model. The data set is described in more detail in Reference 15. Regarding the X-data, both original NIT spectra and NIT spectra that are corrected for scattering effects (by the MSC method)16 will be considered. Prediction ability The prediction results using the PCR method are given in Figure 3. The best result is RMSEP = 0.77, obtained by using eight principal components from scatter-corrected data. The RMSEP obtained for the PCR model that uses the maximum number of factors from the scatter-corrected data is equal to 1.65. In other words, PCR gave much better results than a full MLR based on all components. It can also be seen that scatter-corrected data gave clearly better results than uncorrected data. The prediction results obtained using non-linear ANN models with different numbers of hidden nodes are given in Table 2 for both uncorrected and scatter-corrected data. As will be discussed below, the use of 50,000 iterations was close to optimal for
Figure 3. PCR prediction error as a function of the number of components (meat data). The upper curve corresponds to uncorrected data and the lower curve correspond to scatter-corrected data. Table 2. RMSEP results for a sigmoid ANN with the use of the 50 individual x-values as input (meat data). Both uncorrected and scatter-corrected data are used. All results are based on 50,000 iterations.
Number of nodes in hidden layer 1
10
25
50
Scatter-corrected
0.82
0.70
0.74
0.81
Uncorrected
3.95
2.28
2.26
2.27
this data set, and all results reported in this section are therefore based on this number. As can be seen, the scatter-corrected data gave much better results than the uncorrected data, indicating the advantage of input data prelinearisation, even for the nonlinear ANN technique. The effect of using different numbers of nodes in the hidden layer is weak. For the scatter-corrected data the best results were
T. Næs et al., J. Near Infrared Spectrosc. 1, 1–11 (1993)
7
Table 3. RMSEP results for sigmoid ANN’s computed on scatter-corrected meat data, using different numbers of principal components as input and different numbers of nodes in the hidden layer. 50,000 iterations were used.
sented in Figure 4. The table presents some results obtained using different numbers of principal components as input, and different numbers of nodes in the hidden layer. Scatter-corrected (MSC) data gave much better results than uncorrected data and therefore only the results using scatter-correction are reported. It is clear that it is best to use an intermediate number of PC’s as input, namely eight, and also an intermediate number of nodes in the hidden layer, namely three (RMSEP 0.64). In other words, there is an effect of underfitting and an effect of overfitting both with respect to the number of nodes and the number of input components (variables). It is also worth noting that the results obtained using principal components as input are better than those obtained using individual x-responses as input, although the differences are small. The best ANN result was about 15% better than the best PCR result. For the scatter-corrected data, a linear transfer function network based on all input variables and with only one hidden node, was also tested and the RMSEP was equal to 0.89, which is somewhat larger than the value obtained for the best sigmoid network. For comparison, the linear ANN based on all available principal components gave a RMSEP of 1.66 after 50,000 iterations. (This result is expectedly similar to the linear PCR based on all available components.) This difference in performance between the linear ANN method based on all original variables and the linear ANN method based on all principal components is somewhat surprising, since the two methods are based on essentially the same model. The most plausible explanation of the different performances must therefore be that the iterative ANN learning rule, as opposed to standard least squares fitting, can be dependent on linear transformations of the data. Our interpretation of this result is that when original variables are presented to the network, back-propagation (at least with the learning constants in Table 1) is not able to detect the directions of minor variability and relevance in the X-space. When principal components are used, however, these directions are sorted out and “blown up” by the scaling, and therefore are given the same possibility as the major variabilities to influence the determination of regression coefficients.
Number of nodes in hidden layer 1
3
8
15
0.84
0.79
0.74
0.92
3 Number of input 8 variables 15
0.72
0.64
0.68
0.66
1.05
1.03
1.01
1.06
22
1.32
1.31
1.16
1.37
Figure 4. A “smooth” illustration of the numbers in Table 3. As smoothing technique is used the default method of PROC G3GRID in SAS (SAS Institute, NC, USA).
obtained using 10 and 25 nodes, and for the uncorrected data, the best results were obtained using 10, indicating that an intermediate number of hidden nodes is optimal. The best network gave about 9% better results than PCR. Results obtained by using a sigmoid ANN method based on principal components instead of the original x-variables are presented in Table 3. A “smooth” graphical illustration of the results is pre-
8
Artificial Neural Networks in Multivariate Calibration
networks is, in this case, independent of starting values. However, the differences between replicates are large enough to indicate that the smaller differences in Table 3 should not be interpreted too strongly.
Figure 5. Three replicates of four different networks computed on the meat data. For each network, the three replicates are illustrated by the same symbol. For each network, only a limited number of RMSEP’s were recorded. In each case they are joined by linear interpolation. Overlapping curves are represented by just one line.
The rate of convergence From testing of networks (on the prediction set) for different number of iterations, it was found that about 50,000 iterations gave the best results for this data set. The results from four different networks are presented in Figure 5. This figure also indicates that already after 30,000 iterations, we are close to the optimal values. From 50,000 to 100,000 iterations there is a small increase in the RMSEP. This is interpreted as a slight overfitting of the network model. The results in Table 2 and Table 3 are presented for 50,000 iterations.
Linear ANN models with three, eight and 15 principal components as input were also tested and these gave exactly the same results as the linear PCR solutions. This result is expected because an ANN model with a linear transfer function network in the principal components is essentially the same model as the linear PCR method. In Næs and Isaksson15 the same data were used for evaluating the locally weighted regression (LWR) method (in that paper, however, 100 wavelengths were used). The LWR method based on three principal components (after scatter-correction), gave an RMSEP of 0.65, which is almost equal to the prediction ability obtained from the best ANN model. The ANN, however, needed more principal components, namely eight, to obtain a comparable result.
Polymer data
The influence of starting values on the RMSEP To test the dependence of the fitted neural networks on the starting values of the network parameters, we recomputed different networks using different starting values. In Figure 5, the prediction results (on the test data) of three replicated computations (from 5000 to 100,000 iterations) for four quite different networks are given. It can be seen that the differences between the replicates is quite small and the relative differences between the four
Prediction ability The PCR prediction results are presented in Figure 6. The best prediction result was in this case RMSEP = 2.84, obtained using 38 components (note that 21 components gave only slightly less precise predictions). The RMSEP was, however, quite stable with respect to the number of components, but increased as a result of overfitting as the number of components became very large. The use of the maximum number of components (46 = 47–1,
The data used in this experiment contains the near infrared diffuse reflectance (NIR) spectra of 90 different polyurethane elastomers at 138 wavelengths (X), and the flex modulus (or “stiffness”) value for these samples (Y). The flex modulus values of these samples ranged from 5.8 to 68.9 PSI. A training set of 47 samples and a prediction set of 43 samples was chosen such that the Y-values of the prediction samples were within the range of the values for the training set samples. In this example only the scatter-corrected data (by the MSC method) are used. Details regarding this data set are presented in Reference 17. It should be noted that linear PLS analysis of these data revealed a strong non-linear relationship between the X and Y data when three PLS factors were used.
T. Næs et al., J. Near Infrared Spectrosc. 1, 1–11 (1993)
9
Table 4. RMSEP’s for polymer data in which all 138 wavelengths are used as input (scatter-corrected data). 100,000 iterations were used.
Number of nodes in hidden layer
Figure 6. PCR prediction error as a function of the number of components (polymer data).
1
25
138
2.58
1.96
2.13
Table 5. RMSEP results for sigmoid ANN’s computed on scatter-corrected polymer data, using different numbers of principal components as input and different numbers of nodes in hidden layer. 100,000 iterations were used except for the computations in the last line, where only 1600 iterations were used (see text).
Number of nodes in hidden layer because of mean-centring) resulted in a RMSEP of 3.86. Non-linear ANN prediction results for networks that use each of the input wavelengths as input are given in Table 4, for different numbers of nodes in the hidden layer. As in the previous data set, an intermediate number of nodes, namely 25, in the hidden layer gave the absolutely best result (RMSEP = 1.96). As can be seen, the improvement over the best PCR result is about 30%. It is important to note that the use of all available wavelengths in a traditional least squares approach would be impossible in this case, because the number of samples is much smaller than the number of variables. However, for the ANN method with back-propagation learning, this situation posed no problem. The RMSEP’s obtained by using principal components as input variables for the ANN are given in Table 5. As for the previous data set, there is an effect of overfitting with respect to the number of hidden nodes. It is also clear that the use of 13, 16 and 21 components as input resulted in similar and optimal prediction results. Note also the overfitting effect as we increase the number of components to the maximum. The influence of starting values on the RMSEP As in the previous data set, prediction results of three replicates of four different network models are shown in Figure 7. There are some differences between the prediction errors obtained from the repli-
3
1
3
8
13
3.05
2.98
3.61
3.22
5 Number of input 13 variables 16
3.16
3.13
3.86
4.53
2.65
2.64
3.00
3.11
2.76
2.63
2.85
2.96
21
2.57
2.51
2.64
2.80
46
3.11
3.27
3.38
3.79
Figure 7. Three replicates of four different networks computed on the polymer data. For each network, the three replicates are illustrated by the same symbol. For each network, only a limited number of RMSEP’s were recorded. In each case they are joined by linear interpolation. Overlapping curves are represented by just one line.
10
cate models, but these are not large enough to justify a change in the main conclusions regarding the differences among the four networks. The rate of convergence For this data set, 100,000 iterations resulted in slightly better prediction results (on the test data) than 50,000 iterations, but these differences are very small. The only exception here was the network that used 46 principal components, for which about 1600 iterations resulted in the best model. In other words, a strong effect of overfitting as the number of iterations was increased was observed in this case. All results in Tables 4 and 5 are given for 100,000 iterations, except those obtained from the network with 46 principal components as input.
Discussion There are a number of conclusions that can be extracted from the applications above, which both were selected from earlier publications in which they had indicated non-linear features. First of all, the sigmoid ANN method gave clear improvements in prediction ability compared to the linear PCR method (15% and 30%). In addition, scattercorrected data gave better results than the uncorrected NIR data both for PCR and for the non-linear ANN. This result shows that even the very flexible ANN functions are not always able to cope with non-linearities from the strong non-linear scattering effects in NIR spectra. Next, there is always, except in one case, an effect of overfitting with respect to the number of nodes used in the hidden layer. This result means that the use of a function that is too flexible results in less precise networks. In both examples, the use of principal components as input data gave good results, but the RMSEP varied quite considerably with the number of components used. In both examples, there was a clear effect of overfitting with respect to the number of principal components used, as was also the case for the linear PCR. For one of the sets, the use of all individual x-values as input gave better prediction results than the optimal number of principal components. In the other case the difference in prediction ability between networks based on principal components and networks based on the original
Artificial Neural Networks in Multivariate Calibration
spectral variables was very small. This indicates that the overfitting effect is not necessarily so strong as expected when the original variables are used. When the principal components are used as input, however, the overfitting effect was strong. To understand this fully more research is needed. Networks with sigmoid functions gave, in general, better predictions than networks with linear transfer functions. The number of iterations that gave the best result was different in the two cases. In the first example, 50,000 was close to the optimum and in the other example 100,000 was close. The differences between the two choices, however, was small in both cases. It should also be mentioned that the rate of convergence is dependent on the learning parameters. Parameters other than those used in the present investigation could possibly give somewhat different results, but this was not investigated. The differences between replicated computations of the same network model (with different starting parameters), was quite small in all the cases considered. The main conclusion is that the ANN method, if used in the right way, was a good method for calibration in the two cases tested. The improvements compared to the linear PCR were, however, much more moderate than in the examples in Borggaard and Thodberg.2 This shows that the performance of ANN methods relative to linear methods is strongly dependent on the situation considered. This conclusion is supported if we also compare the above results with the results in Long et al.1 A last point to mention is that even though artificial neural networks may in certain cases be better than for example PCR from a prediction point of view, they are more difficult to understand and the results are much more difficult to visualise and interpret. These aspects may, however, be improved in future developments of ANN techniques.
References 1.
R.L. Long, V.G. Gregoriou and P.J. Gemperline, Anal. Chem. 62, 1791 (1990). 2. C. Borggaard and H.H. Thodberg, Anal. Chem. 64, 545 (1992). 3 W.F. McClure, H. Maha and S. Junichi, in Making Light Work: Advances in Near Infrared
T. Næs et al., J. Near Infrared Spectrosc. 1, 1–11 (1993)
11
Spectroscopy. (Ed. by I. Murray and I. Cowe). VCH, Weinheim, pp. 200–209 (1992). A.R. Barron and R.L. Barron, in Computing Science and Statistics, Proceedings of the 21st Symposium on the Interface. American Statistical Association, Alexandria, VA, pp. 192–203 (1988). M. Maechler, D. Martin, J. Schimert, M. Csoppenszky and J.N. Hwang, in Proceedings of the 2nd International Conference on Tools for AI (1990). Y.H. Pao, Adaptive Pattern Recognition and Neural Networks. Addison-Wesley (1989). J. Zupan and J. Gasteiger, Anal. Chim. Acta. 248, 1 (1991). J. Hertz, A. Krogh and R. Palmer, Introduction to the Theory of Neural Computation. AddisonWesley, USA (1991). H. Martens and T. Næs, Multivariate Calibration. John Wiley & Sons, Chichester (1989).
10. A.R. Barron, in Nonparametric Functional Estimation (Ed. by G. Roussas). Kluwer Academic Publishers, Dordrecht, The Netherlands, pp. 1034–1053 (1991). 11. J.H. Friedman and W. Stuetzle, J. Am. Stat. Assoc. 76, 817 (1981). 12. I. Frank. Chemometrics and Intelligent Laboratory Systems 8, 109 (1990). 13. S. Wold, Chemometrics and Intelligent Laboratory Systems 14, 71 (1982). 14. S. Weisberg. Applied Linear Regression. John Wiley & Sons, New York (1985). 15. T. Næs and T. Isaksson, Appl. Spectrosc. 46, 34 (1992). 16. P. Geladi, D. McDougal and H. Martens, Appl. Spectrosc. 39, 491 (1985). 17. C.E. Miller and B.E. Eichinger, Appl. Pol. Sci. 42, 2169 (1991)
4.
5.
6. 7. 8.
9.
Paper received: 23 October 1992 Accepted: 5 February 1993
12
Artificial Neural Networks in Multivariate Calibration