JOURNAL OF CHEMOMETRICS J. Chemometrics 2004; 18: 112–120 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cem.858
Monte Carlo cross-validation for selecting a model and estimating the prediction error in multivariate calibration Qing-Song Xu1, Yi-Zeng Liang2* and Yi-Ping Du3 1
College of Mathematics and Econometrics,Hunan University,Changsha,People’s Republic of China College of Chemistryand Chemical Engineering,Central South University,Changsha 410083,People’s Republic of China 3 Institute of Chemometrics and Chemical SensingTechnology,Hunan University,Changsha,People’s Republic of China 2
Received 1 May 2003; Revised 29 March 2004; Accepted 29 March 2004
A new simple and effective method named Monte Carlo cross validation (MCCV) has been introduced and evaluated for selecting a model and estimating the prediction ability of the model selected. Unlike the leave-one-out procedure widely used in chemometrics for cross-validation (CV), the Monte Carlo cross-validation developed in this paper is an asymptotically consistent method of model selection. It can avoid an unnecessarily large model and therefore decreases the risk of overfitting of the model. The results obtained from a simulation study showed that MCCV has an obviously larger probability than leave-one-out CV (LOO-CV) of selecting the model with best prediction ability and that a corrected MCCV (CMCCV) could give a more accurate estimation of prediction ability than LOO-CV or MCCV. The results obtained with real data sets demonstrated that MCCV could successfully select an appropriate model and that CMCCV could assess the prediction ability of the selected model with satisfactory accuracy. Copyright # 2004 John Wiley & Sons, Ltd.
KEYWORDS: model selection; prediction error; cross-validation
1. INTRODUCTION One of the most useful methods for modeling and prediction problems in multiple regression analysis is cross-validation (CV) [1–4]. During the last decade the CV method has been extensively used in chemometrics. Examples include selection of a model (or variable) and assessment of the prediction ability of the model in multivariate calibration and quantitative structure–activity relationship (QSAR) research. The appealing characteristics of CV are that it selects an appropriate model according to the prediction and, at the same time, evaluates the prediction ability of the model for unknown samples without requiring any additional new samples for external validation. Because model prediction is the major goal in multivariate calibration and QSAR problems, the CV method is among the most popular ones used by chemometricians. In the literature, CV is generally referred to as the simplest leave-one-out cross-validation (LOO-CV) unless stated otherwise. However, there are some particular problems with LOO-CV. As showed by Efron [5], LOO-CV is a poor candidate for estimating the prediction error. Many other authors have pointed out that LOO-CV often causes overfitting and, on average, gives an underestimation of the *Correspondence to: Y.-Z. Liang, College of Chemistry and Chemical Engineering, Central South University, Changsha 410083, People’s Republic of China. E-mail:
[email protected]
true prediction error [6–8]. They were very careful when using CV and made some improvements over the LOO-CV criterion [9,10]. The reason for LOO-CV having such a deficiency is that it is not a consistent method. An asymptotically consistent method [11] selects the best prediction model with probability one as sample size n ! 1. In this sense, LOO-CV is inconsistent; that is, the probability is smaller than one. With a large sample size it can determine the variable subset belonging to the optimal model [12], but it also selects additional variables [11]. The consequence is that it tends to include unnecessary variables in the model, making the model larger than it should be. Therefore the model selected often performs well in calibration but poorly in prediction. On the other hand, much attention has been paid to CV with more than one sample left out at a time for validation. Multifold CV was introduced by Geisser [13]. Leave-two-out CV sometimes performed better than LOO-CV [14]. Theoretical results of multifold CV can be found in References [15–17]. Monte Carlo cross-validation (MCCV) first appeared in a paper by Picard and Cook [18]. Shao [11] has proven that the MCCV method is asymptotically consistent and pointed out that it has a larger probability than LOO-CV of selecting the model with best prediction ability. In one of our previous papers [19], MCCV was used for PLS modeling and it was shown that the model selected by it obviously performed Copyright # 2004 John Wiley & Sons, Ltd.
Monte Carlo cross-validation 113
better than the one selected by LOO-CV. The difference between MCCV and LOO-CV is that, instead of leaving out one sample at a time for validation, MCCV leaves out a major part of the samples at a time for validation. This certainly enhances the impact of validation on modeling and increases the probability of selecting the best model. Although modeling by MCCV can avoid an unnecessarily large model and decreases the risk of overfitting of the data, MCCV, on average, does not always work when estimating the prediction error for the established model. It often overestimates the prediction error, because it uses only a minor part of the data for calibration. Therefore MCCV needs some modifications if one wants to obtain a more accurate estimate of the prediction error for unknown samples. In this study we first calculate the prediction error for the established model. Then the MCCV method is reviewed. In order to improve the accuracy of estimating the prediction error for the established model, a correction term is added to MCCV. To reveal the effectiveness of the corrected MCCV (CMCCV) method, Monte Carlo simulation experiments are performed. The simulation consists of modeling based on the calibration data set and prediction on the test data set. Finally, three real examples are discussed in detail.
2. THEORY AND METHOD 2.1. Prediction ability of model Suppose we are given the data with n samples represented by p potential variables xi1 ; xi2 ; . . . ; xip and a response yi ð1; 2; . . . ; nÞ. The relationship between response and variables is supposed to be linear. There are some assumptions for the data. For instance, the samples are representative of the relationship and the random errors are homogeneous. Although more complicated situations may occur in chemical practice [8] (e.g. heteroscedastic errors), we base our studies on these assumptions, since they are quite basic for modeling. Thus the model can be stated as y ¼ Xb þ e;
EðeÞ ¼ 0;
CovðeÞ ¼ 2 I
ð1Þ
t
where y ¼ ðy1 ; y2 ; . . . ; yn Þ is the response vector (the superscript ‘t’ denotes the transpose), X ¼ ðxi;j Þði ¼ 1; 2; . . . ; n; j ¼ 1; 2; . . . ; pÞ is an n p matrix, b is a p-dimensional vector of unknown parameters, e is an n 1 random error vector, I is the n n unit matrix, and EðÞ and CovðÞ denote the expectation and covariance respectively. Because the true model is not available—that is, whether some of the elements of b are zero or not is not known—to determine the model is to decide which components of b should be used in the model. In statistics this is equivalent to selecting variables for a regression model. Therefore a more parsimonious model may be true: y ¼ X b þ e
ð2Þ
where is a subset of f1; 2; . . . ; pg, X indicates the matrix whose columns are the ones in X that are indexed by the integers in , and b indicates the vector whose components are the ones in b that are also indexed by the integers in . There are in total 2p 1 possible different models of the form (2). Let R denote all the subsets of the set f1; 2; . . . ; pg; thus 2 R and the size of R is 2p 1. Copyright # 2004 John Wiley & Sons, Ltd.
For the model of form (1), if is selected, then the model is fitted based on Equation (2): 1 ^ b ¼ Xt X Xt y ð3Þ The quality of the established model is evaluated according to its prediction ability, since the major role of regression modeling is to predict unknown future samples [20]. Furthermore, unless we know the expected prediction ability of an estimated model, the estimated model is hardly put into use in practice [21]. Theoretically, the mean squared error of prediction (MSEP) of a model represents its prediction ability. The lower the MSEP, the better is the prediction ability. For the established model (3) the MSEP is given by [22,23] 1 MSEPðÞ ¼ 2 þ p 2 þ b ð4Þ n where p is the number of elements included in b, b ¼ ð1=nÞbt Xt ðI X P Xt ÞXb and P ¼ X ðXt X Þ1 Xt is the projection matrix under model (2). Note that the MSEP consists of two parts: 2 , the variance of the future response, and ð1=nÞ2 p þ b , which corresponds to the error and bias in model selection and estimation. When all non-zero elements of b are included in b ; Xb ¼ Xb ; hence b ¼ 0 and this gives 1 MSEPðÞ ¼ 2 1 þ p ð5Þ n It is deduced from Equations (4) and (5) that the optimal model, which has the lowest MSEP, is the one that not only contains all the non-zero elements of b but is also the most parsimonious model. In multivariate calibration the common data are spectroscopic data, which are characterized by many variables and relatively few observations ðp nÞ. In this situation it is better to use latent variable modeling methods such as PLS [24]. For the PLS model the matrix X can be decomposed as X ¼ t1 pt2 þ t2 pt2 þ þ tk ptk þ R ¼ Tk Ptk þ R, where ti and pi are the PLS scores and loadings respectively and R is the residual matrix. The value k denotes the number of PLS components that are introduced into model (1). Each PLS score ti is a combination of the column vectors of the observation matrix X, i.e. [19,25] Tk ¼ XHk
ð6Þ
The PLS estimator ^ bk of b with k PLS components remaining in the model is ^ bk ¼ Hk ðTtk Tk Þ1 Htk Xt y
ð7Þ
Then the MSEP is also given by Equation (4) but with p ¼ ^k Eð^ E½ðb bk ÞÞt Xt Xð^ bk Eð^ bk ÞÞ and b ¼ ð1=nÞE½ðEð^ bk Þ t t bÞ X XðEð^ bk Þ bÞ.
2.2. Model selection by Monte Carlo cross-validation There are several methods, such as the Akaike information criterion [26] and the Cp statistics [27], which can be used to deal with the selection problem. However, cross-validation [1–4] attracts most attention, since it selects a model from the point of view of prediction. For general CV, when is selected, the n samples (denoted S) are split into two parts. The first part (calibration set), denoted Sc (with corresponding submatrix XSc and subvector ySc ), contains nc samples J. Chemometrics 2004; 18: 112–120
114 Q.-S. Xu, Y.-Z. Liang and Y.-P. Du
for fitting the model. The second part (validation set), denoted Sv (with corresponding submatrix XSv and subvector ySv ), contains nv ¼ n nc samples for validating the model. There are in total ðnnv Þ different forms of sample split. For each sample split the model is fitted by the nc samples of the first part Sc (Equation (3)) to obtain ^bSc . The samples in the validation set are treated as if they are future ones. The fitted model then predicts the response vector ySv : ^Sv ¼ XtSv ^bSc y
ð8Þ
The average squared prediction error (ASPE) over all samples in the validation set is ASPEðSv ; Þ ¼
1 ^Sv k2 ky y nv Sv
ð9Þ
where k k stands for the Euclidean norm of a vector. Let S be the set whose elements are all from the validation sets corresponding to ðnnv Þ different forms of sample split. The cross-validation criterion with nv samples left out for validation is defined as P ASPEðSv ; Þ ð10Þ CVnv ðÞ ¼ Sv 2S n nv where CVnv ðÞ is calculated for every 2 R. Equation (10) serves as an approximation of MSEP() in the situation of finite samples. The CV criterion then is to select the optimal which gives minimum values among all CVnv ðÞ for 2 R. Consequently, the model with variables that are indexed by the integers in is selected. The simplest CV, with nv ¼ 1 (leave-one-out), is widely used in chemometrics. However, it was proven that the model selected by LOO-CV is asymptotically incorrect [11,12]. Although LOO-CV can select a model with bias b ¼ 0 (Equation (4)) (all non-zero elements are included in b ) as n ! 1, it tends to include unnecessary additional variables in the model. This means that the model dimension (p ) is not the most parsimonious and consequently leads to overfitting. For general CV it has been proven, under the conditions nc ! 1 and nv =n ! 1 [11], that the probability for crossvalidation (with nv samples left out for validation) to choose the model with best prediction ability tends to one. In this sense the CVnv ðÞ criterion (Equation (10)) is asymptotically consistent. However, the computation of CVnv with large nv is not applicable (the computational complexity of CVnv is exponential). Monte Carlo cross-validation (MCCV) [11] is a simple and effective procedure. For a selected , randomly split the samples into two parts: Sc ðiÞ (of size nc ) and Sv ðiÞ (of size nv ). Repeat the procedure N times. The repeated MCCV criterion is defined as N 1 X ^Sv ðiÞ k2 ky y ð11Þ MCCVnv ðÞ ¼ Nnv i¼1 Sv ðiÞ By means of the Monte Carlo method the computational complexity can be reduced substantially. Theoretically, the fewer samples used in model calibration, the more repeats are needed, and N ¼ n2 , in general, is enough to make MCCVnv perform as well as CVnv [17]. It should be pointed out that, after the variables are obtained by MCCV, all the samples are used to fit the model to acquire ^b . Copyright # 2004 John Wiley & Sons, Ltd.
With the PLS method there are two approaches for model selection. One is to select a PLS model (determine the number of PLS components for the model) after variable selection. This is hard and time-consuming for variable selection when the variable size is large. The other is to build a PLS model without variable selection. In this study the latter approach is used. In this case, selecting a model means determining the number of PLS components to be included in the model by cross-validation. For PLS there are in total q possible different models taking the pattern of Equation (7), corresponding to k ¼ 1; 2; . . . ; q. The repeated MCCV criterion is defined as N 1 X ^ k Sv ðiÞ k2 MCCVnv ðkÞ ¼ ky y ð12Þ Nnv i¼1 Sv ðiÞ ^k Sv ðiÞ is the predicted response vector using the PLS where y model (with k components) based on XSc without variable selection.
2.3. Estimating MSEP There are papers trying to estimate the MSEP based on finite samples in chemometrics [19,28–30]. Here we chiefly consider using cross-validation methods to estimate the MSEP. As was pointed out [5], the estimate of MSEP using observed data tends to underestimate the true MSEP for new future points, since the data have been used twice, both to fit the model and to check its accuracy. The model is selected to lie close to the observed points, which is what fitting implies. These points result in an optimistic estimate of the model’s true prediction error. For model (2) the LOO-CV criterion is CV1 ð1 Þ ¼ minfCV1 ðÞg 2R
ð13Þ
where CV1 ð1 Þ is obtained by (11) ðnc ¼ 1Þ and 1 denotes the optimal model index by Equation (13). Overfitting of the model selected by LOO-CV occurs in two ways. On the one hand, too many variables (or PLS components) are included in the model [41]. On the other hand, CV1 ð1 Þ always overestimates the prediction ability of the selected model in the sense that EðCV1 ð1 ÞÞ MSEPð1 Þ [31]. Theoretically, EðCV1 ð1 Þ MSEPð1 ÞÞ Oðn2 Þ [15] or, equivalently, EðCV1 ð1 ÞÞ MSEPð1 Þ þ Oðn2 Þ 2
2
ð14Þ
2
where Oðn Þ means that jOðn Þj M=n , M being a positive constant. Equation (14) indicates that, using CV1 ð1 Þ to estimate MSEPð1 Þ, the mean squared error of prediction of the selected model has an accuracy of order 1=n2 . This gives an underestimation of the mean squared error of prediction for the selected model. It should be mentioned that the MSEP depends on the size of the training set. The larger the training set, the smaller is the MSEP. This can be seen clearly from Equation (4) or (5). MCCV can also be used to estimate the prediction. However, since MCCV uses only nc samples for calibration, it is inappropriate to use MCCVnv ðnv Þ to estimate the MSEP for the model with n samples (Equation (4)) if nv is a larger number. Let nv denote the optimal model index by Equation (11). The expectation difference between MCCVnv ðnv Þ and the mean squared error of prediction for the selected model is [15] nv EðMCCVnv ðnv Þ MSEPðnv ÞÞ O ð15Þ nc n J. Chemometrics 2004; 18: 112–120
Monte Carlo cross-validation 115
The accuracy is of order nv =ðnc nÞ. If a large part of the samples are left out for validation, the error should not be very small. Therefore MCCVnv ðnv Þ might be a poor estimation of MSEPðnv Þ. In order to improve the accuracy of estimation, a correction term is needed for MCCVnv ðnv Þ: 2 1 CMCCVnv ðnv Þ ¼ MCCVnv ðnv Þ þ y Xnv b^nv n ð16Þ 2 N 1X 1 y Xnv b^n Sc ðiÞ v N i¼1 n where ^bnv in the second term is estimated based on n samples and ^bnv Sc ðiÞ in the third term is estimated based on nc samples in Sc ðiÞði ¼ 1; 2; . . . ; NÞ. MCCVnv ðnv Þ indicates the average prediction ability of the model with nc samples and, as stated above, it overestimates the MSEP of the model with n samples. The second term is the average residual sum of squares of the model with n samples. The third term is the average residual sum of squares and prediction error of the model based on nc samples. The latter two terms combine the comprehensive effects of the model both with nc samples and with n samples. The following equation confirms that CMCCVnv ðnv Þ does indeed improve the accuracy of estimation, since the accuracy now is of order nv =ðnc n2 Þ: nv E CMCCVnv ðnv Þ MSEPðnv Þ O ð17Þ nc n2 The multifold cross-validation and repeat learning–testing methods [15–17] could be regarded as versions of Monte Carlo cross-validation. Yet, those references did not notice the asymptotically consistent property of the methods. To speak more directly, the authors hinted that the size of the validation set ought to be limited to a very minor part of all samples. These methods, however, share the same deficiency as LOO-CV [32]: they cannot promise to obtain the optimal model with the largest probability. For the PLS model the correction of MCCV is defined as 2 1 CMCCVnv ðknv Þ ¼ MCCVnv ðknv Þ þ y X ^bknv n ð18Þ 2 N 1X 1 y X ^bknv Sc ðiÞ N i¼1 n Because the PLS estimator ^bkSc is non-linear, we cannot obtain an expression like Equation (17). However, CMCCVnv ðknv Þ is expected to be more accurate than MCCVnv ðknv Þ in estimating the MSEP.
3. EXPERIMENTAL Although MCCV can asymptotically select the model with correct variables, its performance on a data set with a finite number of samples needs investigating further. In addition, whether CMCCV can improve the accuracy of estimation of the prediction ability also needs investigating further. In this section, simulated data, quantitative structure–retention relationship (QSRR) data and near-infrared data are used to assess the method. All the data sets were centered and normalized before computation.
3.1. Simulation data In order to explore the capabilities of MCCV in different situations, a set of simulated data has been created for the Copyright # 2004 John Wiley & Sons, Ltd.
calibration model. The following model is considered: yi ¼ 0 þ 1 x1i þ 2 x2i þ 3 x3i þ 4 x4i þ ei
ð19Þ
where i ¼ 1; 2; . . . ; 40, ei are from Nð0; 2 Þ, xhi is the ith value of the hth variable xh , and values of xhi ðh ¼ 1; 2; 3; 4; i ¼ 1; 2; . . . ; 40Þ are taken from a uniform distribution in the interval [0, 2] (U[0, 2]), some components of may be zero. Thus some variables are selected from four possible variables fx1 ; x2 ; x3 ; x4 g, and the model with best prediction ability is selected. The true model is yi ¼ 2 þ 4x3i þ ei
ð20Þ
The levels of random errors are taken into consideration. The random errors obey a normal distribution. Their standard deviations are ¼ 0:1 (low level) and 1 (high level) respectively. The size of the validation set is nv ¼ 15, 20, 25, 30 and 35 respectively, and the number of Monte Carlo simulations is 200. In order to assess the obtained model, another 2000 samples are generated for the purpose of prediction. These samples are used to calculate the MSEP for both the models selected by LOO-CV1 and MCCV and the true model.
3.2. QSRR data Prediction of the retention values of organic compounds using QSRRs is of great interest, because it can allow one to determine the elution order of molecules in various homologous series when authentic standards are not available. The procedure consists in obtaining a relationship (model) between retention index and molecular topological index based on a training set and then predicting the chromatographic retention behavior for new compounds. Katritzky et al. [33] have summarized recent studies in this area. Eleven topological indices of alkanes were selected as variables (descriptors). They are the molecular topological index (MTI) [34,35], the maximum eigenvalue of the distance matrix (MED) [34,35], the determinant of the sum of the distance matrix and the adjacency matrix (DET) proposed by Schultz and co-workers [34,35], the indices Yx [36] and EAID [37] proposed by Xu and co-workers, and the molecular connectivity indices 0 ; 1 ; 2 ; 3 p ; 3 c and 4 p [38]. The values of these 11 topological indices are listed in Table I. Retention indices of the 70 samples that make up the vector y were collected from several different sources [39]. The first 40 samples serve as data for modeling, and the remaining 30 samples are used as the prediction set. These samples are listed in Table I.
3.3. Near-infrared data The near-infrared data came from Reference [40]. The primary purpose of the data was to compare multiple linear regression and ridge regression methods. These data were also used as an example in References [41–43]. Here we use the data for model selection (or variable selection) for the multiple linear regression model. The response y is the protein percentage in ground wheat. The six variables L1–L6 are measurements of the reflectance of NIR radiation by wheat samples at six different wavelengths. These measurements are made on a logð1=RÞ scale, where R is the reflectance. There are in total 50 samples. The first 24 samples are used to obtain the model, and the remaining 26 samples are used as the prediction set to test the obtained J. Chemometrics 2004; 18: 112–120
116 Q.-S. Xu, Y.-Z. Liang and Y.-P. Du
Table I. Training (1^40) and prediction (41^70) sets No. Compound1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
2,2-Di-Me-propane Nonane 2,3-Di-Me-butane 3-Me-pentane 2,2-Di-Me-pentane 2,2,3-Tri-Me-butane 2-Me-hexane 3-Me-hexane 2,2,4-Tri-Me-pentane 2,2-Di-Me-hexane 2,5-Di-Me-hexane 2,2,3-Tri-Me-pentane 2,3,4-Tri-Me-pentane 2,3-Di-Me-hexane 2-Me-heptane 3,4-Di-Me-hexane 2,2,4,4-Tetra-Me-pentane 3-Me-3-Eth-pentane 2,2,4-Tri-Me-hexane 2,4,4-Tri-Me-hexane 2,2-Di-Me-heptane 2,2,3-Tri-Me-hexane 2-Me-4-Eth-hexane 4,4-Di-Me-heptane 3,5-Di-Me-heptane 2,4-Di-Me-heptane 2,3,3-Tri-Me-hexane 2,3,4-Tri-Me-hexane 3,3,4-Tri-Me-hexane 3-Me-4-Eth-hexane 4-Eth-heptane 2,3,3,4-Tetra-Me-pentane 2-Me-octane 3-Me-octane 3,3-Di-Eth-pentane 2-Me-pentane 2,2,3,3-Tetra-Me-butane 3-Me-heptane 2,5-Di-Me-heptane 3,4-Di-Me-heptane Pentane 2,2-Di-Me-butane Hexane 2,4-Di-Me-pentane 3,3-Di-Me-pentane 2,3-Di-Me-pentane 3-Eth-pentane Heptane 2,4-Di-Me-hexane 3,3-Di-Me-hexane 2,3,3-Tri-Me-pentane 2-Me-3-Eth-pentane 4-Me-heptane 3-Eth-hexane 2,2,5-Tri-Me-hexane Octane 2,3,5-Tri-Me-hexane 2,2,3,4-Tetra-Me-pentane 2,2-Di-Me-3-eth-pentane 2,6-Di-Me-heptane 3,3-Di-Me-heptane 2,4-Di-Me-3-eth-pentane 2-Me-3-Eth-hexane 2,2,3,3-Tetra-Me-pentane 3-Me-3-Eth-hexane 3-Me-3-Eth-hexane 4-Me-octane 3-Eth-heptane 2,3-Di-Me-3-Eth-pentane Nonane
RI 411.700 500.000 567.500 584.500 626.000 639.500 666.500 676.000 690.000 719.500 729.000 737.000 753.000 760.500 765.000 771.000 773.100 774.400 789.000 808.200 815.800 821.900 824.300 827.600 833.700 836.400 840.000 849.600 853.100 854.800 857.500 858.900 864.700 870.300 877.600 569.500 725.900 773.500 832.900 858.400 821.560 536.600 600.000 630.000 658.900 671.000 686.000 700.000 732.500 743.500 760.000 761.600 767.500 772.900 776.500 800.000 812.000 820.100 822.200 827.200 836.100 836.500 843.700 852.700 853.500 854.800 862.900 866.900 872.300 900.000
MTI
MED
DET
Yx
EAID
64.000 74.000 108.000 114.000 170.000 156.000 190.000 182.000 242.000 260.000 270.000 230.000 236.000 254.000 288.000 246.000 322.000 232.000 342.000 334.000 380.000 334.000 354.000 348.000 362.000 370.000 326.000 332.000 318.000 338.000 368.000 304.000 416.000 400.000 316.000 118.000 214.000 276.000 378.000 354.000 334.000 106.000 128.000 176.000 162.000 168.000 174.000 204.000 258.000 244.000 226.000 242.000 272.000 260.000 358.000 306.000 348.000 312.000 318.000 394.000 356.000 324.000 346.000 298.000 332.000 370.000 392.000 376.000 310.000 438.000
6.606 8.288 10.000 10.742 13.635 12.394 15.405 14.864 17.034 18.413 19.111 16.315 16.808 18.181 20.479 17.676 20.126 16.670 21.606 21.184 23.963 21.225 22.620 22.271 23.069 23.544 20.795 21.197 20.317 21.753 23.680 19.301 26.272 25.412 20.392 11.059 14.937 19.763 23.929 22.679 21.225 9.670 12.109 14.176 13.070 13.635 14.297 16.625 18.396 17.443 16.068 17.419 19.542 18.779 22.466 21.836 22.063 19.726 20.298 24.790 22.677 20.744 22.220 18.844 21.335 23.554 25.021 24.099 19.856 27.742
128.000 224.000 512.000 644.000 1728.000 1344.000 2372.000 2348.000 5056.000 6220.000 6864.000 4912.000 5440.000 6768.000 8092.000 6700.000 13312.000 6076.000 17872.000 17776.000 21008.000 17680.000 24192.000 21280.000 23256.000 23248.000 17584.000 19312.000 17408.000 23968.000 28800.000 14080.000 26588.000 26784.000 21952.000 656.000 3520.000 8127.000 23116.000 23188.000 17680.000 464.000 817.000 1920.000 1680.000 1872.000 2352.000 2840.000 6816.000 6124.000 4864.000 6832.000 8160.000 8400.000 17968.000 9487.000 19584.000 14272.000 17920.000 22880.000 21148.000 19840.000 24208.000 12736.000 21700.000 23020.000 27148.000 27832.000 17584.000 31008.000
2.581 2.757 2.961 3.019 3.280 3.192 3.392 3.350 3.516 3.590 3.624 3.469 3.494 3.568 3.689 3.537 3.701 3.478 3.762 3.740 3.866 3.743 3.793 3.782 3.818 3.839 3.719 3.735 3.695 3.751 3.831 3.650 3.951 3.914 3.687 3.053 3.395 3.648 3.859 3.798 3.743 2.937 3.136 3.314 3.233 3.270 3.305 3.462 3.581 3.529 3.452 3.518 3.634 3.588 3.806 3.747 3.779 3.674 3.692 3.898 3.804 3.709 3.773 3.630 3.735 3.841 3.895 3.853 3.669 4.000
8.852 7.636 10.556 9.977 13.667 14.541 12.105 12.045 16.847 16.143 14.916 17.094 15.784 14.932 14.169 14.868 22.003 15.846 19.497 19.359 18.752 19.914 16.924 18.439 16.960 16.991 19.753 18.093 19.693 16.970 16.034 21.058 16.250 16.180 18.142 10.008 19.651 14.099 17.050 17.067 19.914 11.114 9.526 12.720 13.584 12.761 11.990 11.474 14.829 15.968 17.015 14.845 14.066 14.005 19.612 13.416 18.026 20.928 19.766 17.126 18.535 18.027 17.034 23.014 18.299 17.176 16.137 16.074 19.569 15.372
0
4.500 4.121 5.155 4.992 5.914 6.077 5.699 5.699 6.785 6.621 6.569 6.785 6.732 6.569 6.406 6.569 7.707 6.621 7.492 7.492 7.328 7.492 7.276 7.328 7.276 7.276 7.492 7.439 7.492 7.276 7.113 7.655 7.113 7.113 7.328 4.992 7.000 6.406 7.276 7.276 7.492 5.207 4.828 5.862 5.914 5.862 5.699 5.536 6.569 6.621 6.785 6.569 6.406 6.406 7.492 6.243 7.439 7.655 7.492 7.276 7.328 7.439 7.276 7.707 7.328 7.276 7.113 7.113 7.492 6.950
1
2.000 2.414 2.643 2.808 3.061 2.943 3.270 3.308 3.417 3.561 3.626 3.481 3.553 3.681 3.770 3.719 3.707 3.682 3.955 3.977 4.061 3.981 4.202 4.121 4.202 4.164 4.004 4.091 4.042 4.257 4.346 3.887 4.270 4.308 4.243 2.770 3.250 3.808 4.164 4.219 3.981 2.561 2.914 3.126 3.121 3.181 3.346 3.414 3.664 3.621 3.504 3.719 3.808 3.846 3.917 3.914 4.037 3.854 4.019 4.126 4.121 4.091 4.219 3.811 4.182 4.181 4.308 4.346 4.065 4.414
2
3.000 1.354 2.488 1.922 3.311 3.521 2.536 2.302 4.159 3.664 3.365 3.675 3.347 3.010 2.890 2.771 5.298 2.871 4.278 4.116 4.018 4.056 3.312 3.664 3.263 3.523 3.893 3.489 3.651 2.962 2.852 4.131 3.243 3.009 2.914 2.183 4.500 2.656 3.485 3.152 4.056 2.914 1.707 3.023 2.871 2.630 2.091 2.061 3.143 3.268 3.497 2.821 2.683 2.471 4.493 2.414 3.851 4.399 3.879 3.719 3.621 3.560 3.201 4.487 3.268 3.364 3.036 2.825 3.516 2.768
3
p
0.000 0.707 1.333 1.394 1.000 1.732 1.135 1.478 1.021 1.280 1.321 2.091 2.103 1.882 1.385 2.259 1.061 2.561 1.658 1.918 1.530 2.200 1.959 1.854 2.199 1.655 2.457 2.593 2.858 2.497 1.971 2.976 1.635 1.997 3.000 0.866 2.250 1.747 1.934 2.359 2.200 1.061 0.957 0.943 1.914 1.782 1.732 1.207 1.571 1.884 2.474 1.992 1.563 1.852 1.472 1.457 1.981 2.366 2.210 1.563 2.164 2.184 2.127 2.914 2.561 2.151 1.832 2.121 3.009 1.707
3
c
2.000 0.000 0.667 0.289 1.561 1.655 0.408 0.289 1.969 1.561 0.817 1.570 0.859 0.569 0.408 0.471 3.121 0.927 1.849 1.615 1.561 1.570 0.612 1.207 0.577 0.697 1.339 0.762 1.255 0.402 0.204 1.488 0.408 0.289 0.707 0.408 2.500 0.289 0.697 0.471 1.570 1.561 0.000 0.817 1.207 0.569 0.204 0.000 0.697 1.207 1.339 0.500 0.289 0.204 1.969 0.000 0.977 1.866 1.510 0.817 1.207 0.803 0.500 2.207 0.927 0.569 0.289 0.204 1.091 0.000
4
p
0.000 0.354 0.000 0.289 0.750 0.000 0.612 0.697 1.225 0.707 0.667 0.612 0.770 0.789 0.803 0.805 1.591 0.750 1.190 1.250 0.905 0.866 1.289 1.479 1.020 1.415 0.933 1.029 0.901 1.427 1.369 0.667 0.980 0.947 1.500 0.577 0.000 0.757 0.822 1.142 0.866 0.000 0.500 0.943 0.250 0.471 0.866 0.677 0.971 0.854 0.408 1.232 1.130 1.105 0.722 0.854 1.016 1.000 1.513 0.934 0.832 1.713 1.380 0.530 1.207 0.859 1.190 1.190 1.067 1.030
1
Me, methyl; Eth, ethyl.
Copyright # 2004 John Wiley & Sons, Ltd.
J. Chemometrics 2004; 18: 112–120
Monte Carlo cross-validation 117
With increasing size of validation set, nv , the following is noteworthy from Tables II and III. (1) The chance for MCCV to select the true model also increases. This means that the risk for the model selected by MCCV to produce overfitting decreases. (2) One may assume that nv ¼ 35 produces the best result, because the selected models give the smallest MSEP. However, two points should be noted. One is that leaving out 35 samples from 40 is too much and there are very few samples per variable in this situation. The other is that the error for CMCCV to estimate the MSEP is not small enough to be accepted. (3) The value of MCCV depends greatly on nv . We take Table II as an example. Obviously, it is not suitable to use the value of MCCV to estimate the MSEP of the selected model when nv > 20, since the error is greater than 10%. In the cases nv ¼ 15 and 20 it seems that the use of MCCV to estimate the MSEP based on the selected model leads to higher accuracy than the use of CMCCV. However, the prediction performance of the selected model is not good in these cases, because the MSEP is obviously larger than the true MSEP (i.e. the MSEP based on the true model). CMCCV is an appropriate candidate to be used to estimate the prediction ability of the selected model. One can see from Table II or III that, instead of increasing with nv as MCCV does, CMCCV varies around the MSEP of the selected model. In the cases nv ¼ 25 and 30, CMCCV estimates the MSEP with satisfactory accuracy. In addition, the MSEP based on the selected model is very close to the true MSEP in these cases.
model. In this paper we use the first 24 samples as well as another six samples from the remaining 26 (nos 1, 4, 8, 12, 16 and 20 in Table III of Reference [40]) as the data for modeling. The residual 20 samples are used as the prediction set.
3.4. Ultraviolet data These data were also used as an example in References [19,25]. The data consist of ultraviolet measurements of mixtures of naphthalene, anthracene, fluorene and phenanthrene. Thirty-five samples were collected, of which 25 samples (randomly chosen) are used as the training set and the others as the prediction set.
4. RESULTS AND DISCUSSION 4.1. Simulation data Simulations based on some other different true models (e.g. different zero components of , more levers of random errors) were also implemented, the results were almost the same. For the sake of simplicity, we only discuss the simulation based on model (20). The results are listed in Tables II and III. As stated in Section 2, LOO-CV tends to include more variables than necessary in the model, leading to overfitting of the selected model. In practice, the true MSEP is not known. Overfitting thus means a higher MSEP when new test samples are predicted than that estimated during cross-validation on the training set. In Tables II and III, nv ¼ 1 indicates LOO-CV. It is seen that there is about a 38–47% chance for CV1 to select a larger model. In other words, there is the same percentage chance for the model selected by CV1 to overfit the data. The MSEP based on the model selected by CV1 is greater than the true MSEP by about 5% ((0.011–0.012)/0.012 or (1.101–1.106)/1.101). For estimating prediction ability, the optimal CV1 is obviously smaller than the prediction ability (or MSEP) of the obtained model. The difference between them is about 8–9%.
4.2. QSRR data Although there are 11 topological indexes available, some of them may have little influence on the retention index (RI). In order to select the best variables for the model, LOO-CV1 and MCCV are utilized for the training set (the first 40 samples in Table I). The results are listed in Table IV. The optimal CV1
Table II. Simulation results when 2 ¼102 Frequencies of selected variables i
Values of optimal CV
nv
3
3,4
2,3
1,4
Other
1 15 20 25 30 35
106 129 145 174 192 200
29 22 18 9 3 0
24 18 17 7 3 0
21 18 17 7 2 0
20 13 3 3 0 0
MESP
CV1
MCCV
CMCCV
1.02 102
1.06 102 1.09 102 1.18 102 1.27 102 1.73 102
1.02 102 1.03 102 1.08 102 1.08 102 1.09 102
CV1
MCCV
1.12 102
1.11 102 1.09 102 1.07 102 1.06 102 1.05 102
TMSEP
1.06 102
Table III. Simulation results when 2 ¼1 Frequencies of selected variables i
Values of optimal CV
nv
3
3,4
2,3
1,4
Other
1 15 20 25 30 35
124 145 151 170 192 199
24 22 19 12 4 1
19 16 12 10 2 0
19 12 11 8 2 0
14 5 7 5 0 0
Copyright # 2004 John Wiley & Sons, Ltd.
MESP
CV1
MCCV
CMCCV
1.018
1.051 1.121 1.128 1.203 1.760
1.016 1.059 1.049 1.024 1.090
CV1
MCCV
1.101
1.091 1.090 1.072 1.058 1.048
TMSEP
1.056
J. Chemometrics 2004; 18: 112–120
118 Q.-S. Xu, Y.-Z. Liang and Y.-P. Du
Table IV. Results for QSRR data Values of optimal CV nv
Model variablesa
CV1
1 10 15 20 25
1, 2, 3, 1, 2, 3, 1, 2, 4, 1, 2, 4, 1, 2, 4,
37.764
4, 5, 6, 4, 5, 6, 5, 8, 9, 5, 8, 9, 5, 8, 9,
7, 8, 11 7, 8, 11 10, 11 10, 11 10, 11
MSEP on test set
MCCV
CMCCV
43.103 48.129 53.88 69.177
39.080 40.038 40.682 43.214
Model by CV1
Model by MCCV
53.768
53.768 39.796 39.796 39.796
a
Corresponding variables: 1, MTI; 2, MED; 3, DET; 4, Yx; 5, EAID; 6, 0 ; 7,1 ; 8,2 ; 9, 3 p ; 10, 3 c ; 11, 4 p .
model is
selects nine variables, and the obtained model is RI ¼ 1062:6 6:543 MTI þ 178:76 MED þ 0:003 DET 711:27 Yx þ 28:799 EAID
y ¼ 25:038 þ 0:022L1 þ 0:246L3 0:243L4 þ 0:010L5 0:033L6
þ 380:53 0 463:26 1 206:07 2 þ 26:151 4p 2
R ¼ 0:998;
s ¼ 5:160
R2 ¼ 0:988; ð21Þ
Leaving out about 50% of the samples (nv ¼ 15, 20, 25) at a time for validation and doing Monte Carlo 200 times, the optimal MCCV selects nine variables. The obtained model is RI ¼ 964:37 3:038 MTI þ 91:755 MED
2
R ¼ 0:987; ð22Þ
From the viewpoint of fitting, it seems that there is no evident difference between models (21) and (22), although model (21) has a lower s. However, the performances of the two models on the prediction set are different. From Table IV it is seen that the MSEP based on model (21) is 53.768, plainly larger than 39.796, the MSEP based on model (22). This may indicate that model (21) has included unnecessary variables. Thus it performs well in fitting but poorly in prediction. This is the typical appearance of overfitting. On the other hand, it is noteworthy that the optimal CV1 obviously underestimates the MSEP based on the model selected by it. As stated in the previous subsection, although MCCV can select a better model than CV1, it tends to overestimate the MSEP for the model selected by it (this can also be seen in Table IV). Table IV shows clearly that CMCCV gives an excellent correction of MCCV to estimate the MSEP when nv > 10. However, when nv ¼ 25, the result may not be very reliable, since there are very few samples per variable and the error for CMCCV to estimate the MSEP increases a little in this case. All these results are in satisfactory agreement with the simulation results.
4.3. Near-infrared data The results for these data are listed in Table V. Firstly, LOOCV is fulfilled on the modeling data. The optimal CV1 selects five variables, namely L1 , L3 , L4 , L5 and L6 . The obtained
TableV. Results for near infrared data Values of optimal CV CV1
MCCV
CMCCV
0.043
0.070
0.050
MSEP on test set Model by CV1 0.066
Copyright # 2004 John Wiley & Sons, Ltd.
We then accomplish MCCV on the training data set of 30 samples. Leaving out 60% of the samples (nv ¼ 18) at a time for validation and doing Monte Carlo 200 times, the optimal MCCV selects four variables, namely L3 , L4 , L5 and L6 . The obtained model is þ 0:008L5 0:040L6
75:713 3p 28:44 3c þ 40:937 4p s ¼ 5:196
Model by MCCV 0.054
ð23Þ
y ¼ 26:101 þ 0:259L3 0:224L4
443:99 Yx þ 3:799 EAID þ 39:352 2 R2 ¼ 0:998;
s ¼ 0:197
s ¼ 0:198
ð24Þ
Comparing models (23) and (24), it seems that the performance of model (23) is a little better than that of model (24), since model (23) has a lower s and a little higher R2 . However, from the viewpoint of prediction ability the conclusion may be contrary. In order to confirm this, both models (23) and (24) are finally used to predict on the prediction set. Figure 1 shows the results. It is seen that the prediction performance of model (24) is better than that of model (23). The MSEP of model (23) on the prediction set is 0.066, which is visibly larger than 0.054, the MSEP of model (24). This reveals that there may be problems with model (23). Model (23) has five variables, one more than model (24). Thus its fitting performance is better. However, this extra variable is unnecessary for the model. It demolishes the prediction ability. In other words, this variable brings overfitting to the model. As for estimating the prediction ability of the model, the optimal CV1 of model (23) on the training set apparently underestimates the MSEP on the prediction set, because the optimal CV1 is obviously smaller than the MSEP (0.043 < 0.066). For model (24), the optimal MCCV on the training set is 0.070, evidently larger than the MSEP (0.054) on the prediction set. This illustrates that it is improper for MCCV to estimate the MSEP for the selected model. However, the corrected MCCV gives an estimate of the MSEP with acceptable accuracy. All these results are again in satisfactory concordance with the simulation results.
4.4. Ultraviolet data For the spectroscopic data we use the PLS model. Based on LOO-CV, the model should contain five PLS components. The minimum CV1 is 6.498 10 4. The results are shown in Figure 2(a). However, the value of MSEP(k), shown in Figure 2(c), reaches the minimum (7.328 10 4) at k ¼ 4. The MSEP is 8.985 10 4 at k ¼ 5. Thus the adopted model should be the one that contains only four PLS components. J. Chemometrics 2004; 18: 112–120
Monte Carlo cross-validation 119
Figure 1. Prediction plot for near infrared data: crosses, model selected by CV1; circles, model selected by MCCV.
Figure 2. (a) LOO-CV, (b) MSEPand (c) MCCV for ultraviolet data. The model selected by LOO-CV includes one extra PLS component, and the minimum CV1 overestimates the prediction ability of the selected model. MCCV is fulfilled with N ¼ 200 and nv ¼ 12. It is seen from Figure 2(b) that the MCCV value reaches the minimum at k ¼ 4, Thus MCCV with about 50% of the samples left out can select the appropriate number of PLS components for the model. However, the minimum MCCV (9.211 10 4) is obviously larger than the minimum MSEP (7.328 10 4). The corrected MCCV at k ¼ 4 is 7.203 10 4, very close to the minimum MSEP. Therefore CMCCV gives an excellent correction of MCCV to estimate the MSEP once again. Copyright # 2004 John Wiley & Sons, Ltd.
5. CONCLUSION Model selection and ascertaining the prediction ability of the model are the central tasks in modeling and prediction problems. Since LOO-CV tends to select an unnecessarily large model and possibly overestimate the prediction ability of the model, there is a need for a method by which less risk of overfitting could be taken and the prediction ability could be estimated accurately. Monte Carlo cross-validation (MCCV) and corrected Monte Carlo cross-validation (CMCCV) are commendable candidates for this purpose. For the simulated data set and the three J. Chemometrics 2004; 18: 112–120
120 Q.-S. Xu, Y.-Z. Liang and Y.-P. Du
real data sets studied here, the following can be concluded. 1. MCCV has an obviously larger probability than LOO-CV of selecting the correct variables for the model. For the examples in this paper, about 50% of all samples are recommended to be left out for validation. It should be pointed out, however, that the recommended percentage of samples left out for validation might be even higher for larger data sets. 2. The optimal CV1 tends to overestimate the prediction ability of the model selected by it. The optimal MCCV tends to underestimate the prediction ability of the model selected by it if a large percentage of samples are left out for validation. CMCCV could give a substantial improvement of MCCV in estimating the prediction ability of the selected model.
21. 22. 23. 24. 25. 26. 27. 28. 29.
REFERENCES 1. Allen DM. The relationship between variable and data augmentation and a method of prediction. Technometrics 1974; 16: 125–127. 2. Stone M. Cross-validatory choice and assessment of statistical predictions (with discussions). J. R. Statist. Soc. B. 1974; 36: 111–147. 3. Wahba G, Wold S. A completely Automatic French Curve. Commun. Stat. 1975; 4: 1–17. 4. Wold S. Cross-validatory estimation of the number of components in factor and principal components models. Technometrics 1978; 20: 397–405. 5. Efron B. How biased is the apparent error rate of the prediction rule? J. Amer. Statist. Assoc. 1986; 81: 461–470. 6. Næs T. Leverage and influence measures for principal component regression. Chemometrics Intell. Lab. Syst. 1989; 5: 155–168. 7. Ho¨skuldsson A. PLS regression methods. Chemometrics Intell. Lab. Syst. 1996; 32: 37–55. 8. Marter H, Martern M. Multivariate Analysis of Quality: An Introduction. John Wiley & Sons Ltd.: Chichester, 2001. 9. Unscrambler for Windows, User’s Guide. CAMO AS: Trondheim, Norway, 1996. 10. Indahl UG, Næs T. Evaluation of alternative spectral feature extraction methods of textural images for multivariate modelling. J. Chemometrics 1998; 12: 261–278. 11. Shao J. Linear model selection by cross validation. J. Amer. Statist. Assoc. 1993; 88: 486–494. 12. Stone M. An asymptotic equivalence of choice of model by cross-validation and Akaike S criterion. J. R. Statist. Soc. Ser. B 1977; 39: 44–47. 13. Geisser S. Predictive sample reuse method with applications. J. Amer. Statist. Assoc. 1975; 70: 320–328. 14. Herrzberg G, Tsukanov S. A note on modification of the Jacknife criterion on model selection. Utilitas Mathematics 1986; 29: 209–216. 15. Burman P. A comparitive study of ordinary crossvalidation, v-fold cross-validation and repeated learning-testing methods. Biometrika 1989; 76: 503–514. 16. Breiman L, Friedman JH, Olshen RA, Stone C. Classification and Regression Trees. Wadsworth: Belmont, CA, 1984. 17. Zhang P. Model selection via multifold cross validation. Ann. Statist. 1993; 21: 299–313. 18. Picard RR, Cook RD. Cross-validation of regression models. J. Amer. Statist. Assoc. 1984; 79: 575–583. 19. Xu QS, Liang YZ. Monte Carlo cross validation. Chemometrics Intell. Lab. Syst. 2001; 56: 1–11. 20. Faber K, Kowalski BR. Propagation of measurement errors for the validation of prediction obtained by prinCopyright # 2004 John Wiley & Sons, Ltd.
30.
31. 32. 33.
34. 35.
36. 37. 38. 39.
40. 41. 42. 43.
cipal component regression and partial least squares. J. Chemometrics 1997; 11: 181–238. Martens HA, Dardenne P. Validation and verification of regression in small data sets. Chemometrics Intell. Lab. Syst. 1998; 44: 91–121. Martens H, Næs T. Multivariate Calibration. Wiley: Chichester, 1989. Brown PJ. Measurement, Regression and Calibration. Clarenton: Oxford, 1993. Wold S. Discussion: PLS in chemical practice. Technometrics 1993; 35: 137–139. Xu Q-S, Liang Y-Z, Shen H-L. Generalized PLS regression. J. Chemometrics 2001; 5: 135–148. Akaike H. A new look at the Statistical identification model. IEEE Transactions on Automatic Control 1974; 19: 716–723. Mallows CL. Some comments on Cp. Technometrics 1973; 15: 661–675. Lorber A, Kowaski BR. Estimation of prediction error for multivariate calibration. J. Chemometrics 1988; 2: 93–109. De Vries S, Ter Braak CJF. Prediction error in partial least squares regression: a critique on the deviation used in The Unscrambler. Chemometrics Intell. Lab. Syst. 1995; 30: 239–245. Denham MC. Choosing the number of factors in partial least squares regression: estimating and minimizing the mean squared error of prediction. J. Chemometrics 2000; 14: 351–361. Hjorth U. Computer Intensive Statistical Methods-Validation Model Selection and Bootstrap. Chapman & Hall: London, 1994. Racine JA. Consistent cross-validatory method for dependent data: hv-block cross-validation. Journal of Econometrics 2000; 99: 39–61. Katritzky AR, Chen K, Maran U, Carison DA. QSPR correlation and predictions of GC retention indexes for Methyl-Branched hydrocarbons produced by insects. Anal. Chem. 2000; 72: 101–109. Schultz HP. Topological organic chemistry—I: graph theory and topological indices of alkanes. J. Chem. Inf. Comput. Sci. 1989; 29: 227–228. Schultz HP, Schultz EB, Schultz TP. Topological organic chemistry—II: graph theory, matrix determinants and eigenvalues, and topological indexes of alkanes. J. Chem. Inf. Comput. Sci. 1990; 30: 27–29. Yao YY, Xu L, Yuan XS. A new topological index for research on structure-property relationship of alkanes. Chinese Acta Chimica Sinica 1993; 51: 463–469. Hu CY, Xu L. On highly discriminating molecular topological index. J. Chem. Inf. Comput. Sci. 1996; 36: 82–90. Kier LB, Hall LH. Molecular Connectivity on Chemistry and Drug Research. Academic Press: New York, 1976. Du YP, Liang YZ, Wu CJ. Database construction of GC retention index and correction of mistakes in it. Chinese 8th computers and applied chemistry conference, Huangshan, 2001; 147–149. Fearn T. A misuse of ridge regression in the calibration of near infrared reflectance instrument. J. Appl. Statist. 1983; 32: 73–79. Næs T, Irgens C. Comparison of linear statistical methods for calibration of NIR instruments. J. Appl. Statist. 1986; 35: 195–206. Hoerl AE, Kennard RW, Hoerl RW. A practical use of ridge regression: a challenge met. J. Appl. Statist. 1985; 34: 114–120. Stone M, Brooks RJ. Continuum regression: crossvalidated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression. J. R. Statist. Soc. Ser. B, 1990; 52: 237–269. J. Chemometrics 2004; 18: 112–120