Cross model validation and optimisation of bilinear ... - CiteSeerX

10 downloads 3727 Views 763KB Size Report
Cross model validation is designed to validate the optimisation by including the ... Keywords: Cross model validation; Partial least squares regression; PLSR; ...
Available online at www.sciencedirect.com

Chemometrics and Intelligent Laboratory Systems 93 (2008) 1 – 10 www.elsevier.com/locate/chemolab

Cross model validation and optimisation of bilinear regression models Lars Gidskehaug ⁎, Endre Anderssen, Bjørn K. Alsberg Chemometrics and Bioinformatics Group, Department of Chemistry, Norwegian University of Science and Technology, N-7491 Trondheim, Norway Received 21 December 2006; received in revised form 14 January 2008; accepted 21 January 2008 Available online 9 February 2008

Abstract Whenever regression models are optimised, it is important that all optimisation steps are properly validated. Variable selection is one example of parameter estimation that will give overly optimistic models if not included in the validation. There are many examples of reported work where the validation is performed posterior to variable selection, and many have correctly noted that these models are optimistically biased. However, if the availability of samples is limited, separation of the data into a training and validation set may decrease the quality of both the calibration model and the validation. Cross model validation is designed to validate the optimisation by including the variable selection in an extra layer of crossvalidation. This means that all available samples are utilised both in the training and for estimating the residual error of the model. Cross model validation poses challenging questions both conceptually and algorithmically, and a presentation of the full work-flow is needed. We present a complete framework including optimisation, validation and calibration of bilinear regression models with variable selection. Several issues are addressed that are important for each separate stage of the analysis, and suggestions for improvements are proposed. The method is validated on a gene expression data set with a low signal-to-noise ratio and a small number of samples. It is shown that many replicates are needed to model these data properly, and that cross model validated variable selection improves both the final calibration model and the associated error estimates. A Matlab toolbox (Mathworks Inc, USA) is available from www.specmod.org. © 2008 Elsevier B.V. All rights reserved. Keywords: Cross model validation; Partial least squares regression; PLSR; Variable selection; Backward elimination; Microarray data

1. Introduction Chemometricians are used to data where very many measurements are available for each sample, such as produced by various spectroscopic methods. Even though such data may contain thousands of variables, the variation of interest is often found in a much smaller number of spectral peaks. These areas of highly correlated, neighbouring wavelengths may be interpreted in terms of the spectra of the pure analytes [1]. More recent applications in genomics and proteomics may produce data tables of similar shape to the traditional chemometric data, however without the highly correlated structure of spectra. For instance, tens of thousands of expression levels may be measured in parallel on DNA microarrays [2–4], where neighbouring spots are not related in any metabolic or regulatory sense. It is then of interest to find a

⁎ Corresponding author. Current address: CIGENE, N-1432 Ås, Norway. E-mail address: [email protected] (L. Gidskehaug). 0169-7439/$ - see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2008.01.005

subset of genes differentially expressed across the conditions of interest. This can enhance the understanding of the biology involved, and relevant genetic markers can be found. There are many sources of error in the experimental procedures, and the biological repeatability between test subjects or animals is usually low. These sources of variation lead to data which often have a very low signal-to-noise ratio, and the detection of differentially expressed genes becomes a difficult task. Variable selection is a valuable aid both in the analysis and in the interpretation of microarray data. It is, however, of great importance that the analysis is properly validated. Building a bilinear regression model in which variable selection is included poses other challenges than if no selection of variables is performed. The most prominent difference is that an ordinary one-layer cross-validation will give overly optimistic results after variable selection [5–8]. This is the background for an ongoing controversy in the QSAR-field. It is claimed in Ref. [5] that an external validation set is necessary for obtaining honest validation results after parameter estimation. This claim is countered by at least two independent groups [6,7] which show

2

L. Gidskehaug et al. / Chemometrics and Intelligent Laboratory Systems 93 (2008) 1–10

that good estimates of the prediction error are obtained when the parameter estimation is included in the cross-validation. In a variable selection problem, the variables minimising the prediction error are chosen. This introduces a downward selection bias in the resulting estimate for the residual error. An independent validation set will give an unbiased estimate of the predictability of the model, however, this is not a feasible solution if the total number of samples is small. Such is the case for microarray data, where practical, economical or even ethical considerations often limit the number of experiments performed. The low signal-tonoise ratio of these data on the other hand demands a large training set in order to find a good calibration model. Cross-validation is designed to utilise all samples both for training and for testing. By adding an extra cross-validation loop external to the variable selection, estimates are found for the prediction error which are not biased in an optimistic way [6–11]. This double-layer crossvalidation is referred to as cross model validation (CMV). Anderssen et al. [6] introduce CMV for bilinear models and show that the estimated prediction error is comparable to the error from an independent validation set. They suggest a backward elimination of variables will result in better models compared to models in which variables are removed in a single step. This methodology will be verified and followed up here. Many papers stress the importance of validation, as a model cannot be trusted unless it is predictive for future samples. The actual model building is often given less attention. However, it may be very difficult to obtain good calibration models for data with few replicates and a low signal-to-noise ratio. A full overview of the cross model validation including the model building is therefore presented. The analysis is divided into three parts; i) optimisation, ii) validation and iii) calibration. We discuss the different objectives of each step and specifically

suggest a set of recommendations that will generate properly validated calibration models with better predictive ability. The presented improvements are validated on data from the microarray literature, as such data are inherently noisy and well suited to illustrate the advantages of CMV. We have chosen to present an optimisation and validation scheme which is tailored for partial least squares regression (PLSR) [12], however, most of the principles of CMV are directly transferable to other methods for regression and classification. Variable selection is performed based on jack-knife of regression coefficients with a modified version of the standard uncertainty test [12–14]. We use the term “model size” to describe the number of independent variables included in the model building, and “model rank” denotes the number of components used. The abbreviation “PC” is used to denote PLSR components. Upper-case bold denotes matrices and lowercase bold denotes column-vectors. A Matlab toolbox (Mathworks Inc, USA) containing all the code used in this work is freely available from www.specmod.org. 2. Theory and methods 2.1. Overview of method A flow chart of the CMV is given in Fig. 1. A regressor matrix X of size N by P contains the P independent variables for N objects. An N by L matrix Y holds the L dependent variables, which may correspond to continuous responses or some categorical class-information. The objects in X and Y are divided into a total of M ∈ {3,4,…, N} segments of similar size. This segmentation is utilised both for the outer and the inner cross-validation loops. For instance, if M = 3, one third of the

Fig. 1. Flow chart of the CMV. Square boxes are input and output parameters, diamonds are for-loops and rounded boxes are instructions. The loop across segments indicated is the outer cross-validation loop, the loop across model sizes corresponds to an optional backward elimination of variables. Critically, any variable selection is performed within the outer loop.

L. Gidskehaug et al. / Chemometrics and Intelligent Laboratory Systems 93 (2008) 1–10

objects are kept out in the outer loop, and a split-half crossvalidation with jack-knife is performed with the remaining objects in the inner loop. As jack-knife estimates the uncertainty of model parameters based on their stability in the crossvalidation, a value of M = 3 would in practice yield very imprecise results. At the other end of the scale, a leave-one-out cross-validation in both loops is performed when M = N. This yields more degrees of freedom to estimate the parameter uncertainty, however, the error estimates from such a validation tend to be slightly underestimated [15]. In the outer validation loop, each segment m ∈ {1,2,…, M} is in turn removed from the input matrices to give X− m and Y− m, leaving Xm and Ym for testing. The subscript − m is in general used to denote arrays based on or resulting from the set of all objects not in m. The subscript m similarly indicates correspondence with segment m. The jack-knife [12,14] assigns significance scores to all Xvariables based on the stability of PLSR regression coefficients across the inner cross-validation loops. Traditionally, a t-like test is used, and we present a modified T2-test below which accounts for several Y-variables simultaneously. An ordinary significance level α ∈ 〈0,1〉 may thus be used to assign significance to the X-variables. This is the standard approach, however it has two drawbacks. First, it may be difficult to establish the optimal value of α due to multiple testing and dependencies between tests. Second, if the number of Xvariables is large, there is an increased risk that some of the relevant Y-variation is unmodelled by PLSR [16,17]. If a model which inadequately describes Y is used for jack-knife, the significant variables will not necessarily be relevant for Y. Both these drawbacks are addressed by a backward elimination of variables within the outer CMV-loop. A series of pre-defined cutoffs k ∈ {1,2,…, (P − 1)}, indicates the number of variables to include in each step. Recommendation 1. Use restrictions on model size rather than an α-level in the significance testing. A backward elimination of variables can then be performed by defining several model size cutoffs in a decreasing manner. A total of D ∈ {1,2,…,(P − 1)} cutoffs are defined such that kd N kd + 1, where d ∈ {1,2,…, (D − 1)}. It follows that D obtains its maximum value if one variable at a time is removed until a single variable remains. This is rarely a practical solution when the number of variables P is large, and in most cases D ≪ P. The use of sequential variable selection will yield gradually better model fits as more of the irrelevant variables are removed. This will in turn yield better and more relevant jack-knife estimates. The need for multiple testing adjustments is also circumvented. Because each sub-model is validated, the model size with smallest prediction error can be chosen instead. In order for the sub-models to be comparable, the same set of model sizes must be used for each segment m. The k variables chosen in each step, however, vary freely between segments. 2.2. Optimisation A PLSR-model is calculated based on X− m and Y− m, and cross-validated as explained in Ref. [12]. This inner cross-

3

validation is used to find the optimal number of components, k , the mean squared error of prediction, and the regression A− m coefficients for each inner segment and each model size. Special k considerations should be taken in the estimation of A− m when there is more than one Y-variable; this is explained in detail k below. The residual error for the model of rank A− m is kept in o the vector msep− m, which will also hold the residual errors for all other model sizes. This vector may also include the initial variance, i.e. the residual error after zero components, in order to calculate percentage errors. The superscript o is a reminder that only the error of the model with optimal rank is included at each backward elimination step. Jack-knife of the regression coefficients are performed similarly to Refs. [12,14], however with certain modifications explained in detail below. A new k PLSR-model is calculated based on X− m , which holds only the k most significant variables. This optimisation is repeated D k times, in each step removing a new set of variables from X− m . The optimal size K of a sub-model is the one corresponding to to the minimum value of msep−o m. Note that K varies freely between segments, however, the subscript − m is dropped for simplicity of notation. Output from the backward elimination loop is thus • • • •

The optimal sub-model size K (= K− m) The optimal sub-model rank A−K m The list of K significant variables varindex− m The regression coefficients B−K m corresponding to the optimal sub-model rank and size o • The vector of optimal residual cross-validated errors msep−m for each sub-model size. It is generally advised that the optimal number of components is found requiring that each included component leads to a significant decrease in the residual error of the model [12]. This principle is a safeguard against overfit, and it ensures that the resulting models are more easily interpreted. When several Y-variables are included, the mean residual error is often used to test for predictive ability. However, the criterion of parsimony is not always adequate during the optimisation stage of a CMV. An example is given in Fig. 2, where the residual error for two Y-variables is plotted along with the mean values. The optimal

Fig. 2. Illustration of the residual variances for a model with two Y-variables. The responses y1 and y2 are best explained by PC1 and PC4, respectively. Four components should therefore be included in the optimisation step, even if the mean of variances (dashed line) indicates that two components is the optimal choice.

4

L. Gidskehaug et al. / Chemometrics and Intelligent Laboratory Systems 93 (2008) 1–10

rank based on the overall predictive ability is two, however, the dependent variable y2 is best explained by the fourth component. A model based on two components will only find variables that explain y1, and the other response will be disregarded. Recommendation 2. During the optimisation stage of a CMV, a model rank should be chosen that ensures modelling of each individual Y-variable. Later, during the calibration stage, the rank must be conservatively chosen to avoid overfit. We propose that the optimal number of components for each CMV-segment and model size is found as A−k m = max(Amean, Aall). The estimate Amean is the optimal rank based on the mean of the error variances across Y-variables (for the current segment and model size). This value may be found by requiring that each added component must lead to a certain improvement in residual error, measured for instance in standard deviations or percent. The estimate Aall is the highest number between the components that best describe each of the Y-variables. The concern for overfit is not as important here as in the final calibration, as the optimal submodel will later be validated on left-out data. The jack-knife method presented in Refs. [12,14] is widely used for cross-validated, bilinear models. A specially tailored

one-sample t-test is used to test if the model parameters are zero, and significance is assigned to the variables with non-zero parameters. In this work, the regression coefficients are used to test for significance, however, the presented methodology is similar if loadings or loading-weights are used [16]. A problem occurs with the standard uncertainty t-test when several responses are modelled, because there will be a vector of regression coefficients for each of the L responses. Each of them is tested separately for significance, which leads to total of L p-values for each variable in X. The user is left to convert these statistics into a single p-value that reflects the significance of the variable. One criterion may be that a variable must come out significant in one or all of the tests in order to be selected. If the regression coefficients for two responses are plotted in a scatter plot, this criterion will favour variables outside a square pattern around the origin. This is illustrated in Fig. 3a. If the geometric mean of the p-values is used instead, this will favour variables outside a convex diamond pattern around the origin, as shown in Fig. 3b. For illustration purposes, the variances of the parameters have been disregarded for these plots. It is seen from Fig. 3a and b that artifacts are introduced when several responses are tested with the standard uncertainty test. A modified version of Hotelling's T2-test that is able to test loadings

Fig. 3. Illustration of the significant variables found based on four different significance tests on arbitrary data. The regression coefficients b1 and b2 for a model with two Y-variables are plotted in scatter plots. a) Univariate t-tests are performed for each response and the most significant outcome is used. b) Univariate t-tests where the geometric means over the p-values are used. c) Hotelling's T2-test that tests significance for both responses simultaneously. d) Same test as in c), but without shrinkage of the regression coefficients. For a), b) and c), the variances of the variables are disregarded for illustration purposes.

L. Gidskehaug et al. / Chemometrics and Intelligent Laboratory Systems 93 (2008) 1–10

in several components simultaneously has been previously described [16]. The same methodology can also be used for testing of multiple regression coefficients. The test favours variables outside an elliptic shape that spans the direction of maximum covariance around the origin. An illustration is given in Fig. 3c. Recommendation 3. An uncertainty test based on Hotelling's T2-statistic has been developed for jack-knife of regression coefficients from bilinear models. This test should be preferred for analysis of regression models with more than one Y-variable. In the modified jack-knife, each variable is tested under the null hypothesis that all the corresponding regression coefficients are zero. Variables for which the null hypothesis is rejected are declared significant. Formulae for the modified jack-knife are given in Appendix A. When a single response is modelled, the modified test yields the same results as the standard uncertainty test [12]. When P ≫ N, and the signal-to-noise ratio is low, the variance estimates for many of the regression coefficients will be imprecise. This is manifested when variables close to the origin falsely come out as significant because they appear to be stable, and when points far from the origin are overlooked because of overestimated variances. An illustration of this phenomenon is given in Fig. 3d. The number of false outcomes may be reduced by weighting the variances toward a common variance estimate based on all variables. This is recommended in Refs. [3,18] and is referred to as shrinkage. Recommendation 4. Shrinkage of the individual variance estimates toward a variance estimate across variables should be considered in the jack-knife when the signal-to-noise ratio is low. Calculation of the T2 -statistic involves inversion of a variance–covariance matrix C of size L by L for each variable (see Appendix A). If the responses in Y are many and correlated, or linearly dependent such as for discriminant PLSR [12], results may be inaccurate due to inversion problems. Also, testing the same hypothesis on many highly correlated regression coefficients is redundant. In cases where this may pose a problem, we recommend to orthogonalise Y prior to the analysis, for instance with a singular value decomposition. The predicted responses can later be expanded back to their original interpretation by multiplication with the right singular vectors.

5

size is assigned to a matrix MSEPo = (msep−o 1, msep−o 2,…, msep−o M)T of size M by (D + 2). The objects in ŶCMV corresponding to different segments m result from models independently optimised with respect to rank, size and selected variables. It follows that this matrix may be used to validate the complete optimisation process, variable selection included. Error estimates such as the cross-validated correlation coefficient q2 are found from ŶCMV. It is shown in Ref. [6] that q2 is as conservative as an estimate based on an independent validation set. If Y consists of categorical classinformation, ŶCMV can be used to estimate any measure of expected classification performance. For instance, the classification correctness is the fraction of samples in a class which is correctly assigned to that class. Another useful measure is the purity, which is the ratio of classified samples which are correct. If the variable selection had not been included in the outer cross-validation loop, any overfit would have gone unnoticed and the residual error would only represent the training data. 2.4. Calibration Based on the CMV output, the optimal model size Kopt and the optimal model rank Aopt are estimated as described below. A calibration model is then trained based on all N samples; however with the restriction that the final model is of rank Aopt and of size Kopt. The calibration stage may thus be regarded as a restricted optimisation stage based on all data. The main goal of the calibration is to train a model with the best possible ability to predict future, unknown samples. It is a question about using as much as possible of the available data while minimising the risk of overfit. A comparison with the process of building calibration models without variable selection may be illustrative. Consider an arbitrary cross-validated model with known residual errors. The cross-validated error is given by the msep, while the estimated error based on all data is denoted msec. Two such error variances are plotted in Fig. 4. The optimal msep is found for two components, whereas five components give the optimal msec. Common practice is to use Aopt = 2 based on this plot, because five components results in overfit. The validation results are in other words used to choose the best parameters for future predictions. This is also discussed in Ref. [15].

2.3. Validation The concept of validation and its importance for obtaining good estimates of residual errors has been established previously [6–8,12]. For each optimised sub-model in the cross-validation, the left-out sample(s) Ŷm is predicted for all K K model ranks a ∈ {1,2,…,Amax} based on Xm and B− m . Only the variables indicated by varindex− m are included in the testK sample(s) Xm . The predicted values from the model with rank K CMV A− m are called Ŷm and are assigned to a separate matrix CMV Ŷ . The sole purpose of ŶCMV is to provide unbiased error estimates; this is explored in detail in Ref. [6]. Finally, the vector of optimal residual errors corresponding to each model

Fig. 4. A cross-validated residual error msep and a corresponding error variance based on all data msec. Five components yield a good fit to the training data, but the predictive ability, as indicated by an asterisk, is not so good. The best predictive ability is found for two components, as indicated by the validation results.

6

L. Gidskehaug et al. / Chemometrics and Intelligent Laboratory Systems 93 (2008) 1–10

The same principle applies also to calibration models which have been optimised with variable selection. However, it was shown in Section 2.3 that the variable selection itself must be included in the cross-validation in order to get reliable error estimates. A second parameter optimisation loop based on all objects would hence tend to overfit such as the msec in Fig. 4. The solution is to use Ŷ from the validation step to find a more conservative error estimate—the mean squared error of cross model validation, MSECMV. This matrix is of size L by (Amax + 1), where the first column holds the residual errors corresponding to zero components. The mean over Y-variables is good estimate of the predictive ability of models of different rank. Recommendation 5. The optimal rank Aopt in a calibration model after variable selection should be found based on the error residuals from the CMV, as these estimates are not biased by the variable selection. According to Recommendation 5 the validation results are used to find the optimal calibration model. This strictly means that selection bias may be introduced when setting Aopt, as the calibration model itself will not be tested on an independent validation set. However, as is illustrated in Fig. 4, the use of validation results to find the optimal number of components is common practice. If the model rank is chosen parsimoniously, the risk of overfit is small. In accordance with Recommendation 2, no extra component should be added to the calibration model unless it causes a significant decrease in the overall residual error. A less straight forward task is to select variables for the calibration model. Following the logic of Recommendation 5, the CMV results should be used also for this task. However, because all sub-models from the validation consist of different sets of variables, a jack-knife analysis of the regression coefficients from the CMV would have to be performed on a sparse matrix. Replacement of the missing values by zeros would favour variables more often chosen, however, the resulting p-values would loose their statistical interpretation. Also, the sub-models would have to be made comparable prior to significance testing, as a model based on many variables tends to have smaller regression coefficients than a model with few variables. Despite such drawbacks, an approach like this might be shown to have merit if properly developed and tested. Another possible approach is to select variables according to the rate in which they are declared significant in the validation. This would imply that variables which often come out as significant in the validation are more important than rarely selected variables. Also for this solution there is a possible problem of incomparable sub-models. Because the regression coefficients are based on multivariate regression models, all the significant variables from a sub-model may be needed to get a good model. Even if the same variable is significant in two different submodels, there is a risk that different effects are modelled in the two cases. Even between two models that only differ in rank the same significant variable may contribute in a completely different way. A second point is that the optimal rate would not be known from the validation. However, Westad et al. [19] report successful use of such a significance assessment in combination with permutation testing.

The solution in Ref. [6] is to optimise a new model based on all objects, however, this may give overly optimistic results as discussed above. We propose instead to estimate an optimal model size Kopt based on the matrix MSEPo from the validation. Each element corresponds to the optimal residual error for a certain model size and a certain segment. The mean over segments is thus a validated measure of the relative predictive ability between model sizes. Recommendation 6. If several model sizes have been tested, the optimal size Kopt for a calibration model is the number of variables that results in the smallest mean residual error from the validation. The curve segment spanning the minimum should preferably be robust with respect to the number of variables. The backward elimination of variables is performed for all data with the same settings as in the optimisation. The elimination is halted when the optimal set of Kopt variables is reached, for which a calibration model is calculated based on the optimal number of components Aopt. If D = 1, for instance if a significance level is used to select variables, the same cutoff is used in the calibration as in the training. 2.5. Data Genetic responses to different treatments for leukaemia have been measured in vivo [2]. Cell samples were obtained from 60 patients prior to treatment and 24 h after treatment, and gene expressions were measured for all N = 120 samples on HGU95A oligonucleotide arrays (Affymetrix, Inc., USA). The data can be obtained from Ref. [20], with the accession number GDS330. A total of P = 12 533 probe sets, including 9670 genes, are arranged in the data table X. The matrix is logtransformed and each row (microarray) is normalised to zero mean. It is noted in passing that neither the normalisation nor any previous data preparation steps need be explicitly validated by the CMV, as these are pre-processing steps not guided by the treatment information. The 60 prior samples constitute a Control group. The four different treatments include MP (mercaptopurine, 12 samples), MT (methotrexate, 22 samples), MPL (MP and low dose MT, 16 samples) and MPH (MP and high dose MT, 10 samples). Categorical indexes to all L = 5 groups are arranged in the matrix Y, with ones denoting membership to one of the groups. 3. Results A leave-one-patient-out cross-validation is performed both in the outer and inner loops, by holding out the control and treatment samples of one patient at a time. The resulting residual error reflects the ability of the model to predict a genetic response to treatment for future patients. To stabilise the jack-knife, all the gene-specific covariances are shrunk with a contribution of β = 0.1 times the mean over covariances. A stepwise elimination of variables is performed for all k ∈ {12 000, 11 500, …, 1500, 1200, 1100, … 100}. The percent residual variances for a subset of the model sizes are given in the box-and-whisker plot in Fig. 5, with the mean(MSEPo) given as a solid line. A model size of

L. Gidskehaug et al. / Chemometrics and Intelligent Laboratory Systems 93 (2008) 1–10

7

Fig. 7. The MSECMV for the three groups Control, MT and the pooled MP-group plotted with solid lines from bottom to top. The mean error is given by the dashed line, and the optimal number of components Aopt = 3 is indicated by an asterisk. Fig. 5. The percent residual MSEPo for a subset of the model sizes k. The boxes are not true confidence intervals, however they include the data values that fall between the lower and upper quartiles for each model size. The median is marked with a horizontal line within each box, and the whiskers extend to data points no more than 1.5 times the box-length. The solid line is the mean(MSEPo) and the minimum value at Kopt = 900 variables is indicated with a black dot.

Kopt = 900 variables is found to give the best overall predictability and is chosen for the calibration model. It is useful to calibrate a model based on all variables for comparison. The percent residual cross-validated variances for such a model are given in Fig. 6a. The Y-variables have been

orthogonalised to avoid inversion problems in the jack-knife, and the solid lines represent the errors for each of the orthogonal responses. The mean over variances is given by the dashed line and the optimal model rank estimated from the mean is Amean = 4. This corresponds to a residual error of 62% and the point is indicated by an asterisk. One of the responses is well explained by the first component, another is well explained by the third component, while the last two responses are best explained by the seventh component. The optimal number of components based on the criterion that all responses must be explained is thus Aall = 7, as indicated by the circle.

Fig. 6. Percent residual variances for the leukaemia-data. A solid line is drawn for each response, and the mean residual variances are indicated by dashed lines. a) An optimised model with all variables included. The number of predictive components is four, but seven components are needed to explain all of the Y-variables. b) An optimised model with 900 variables. During the backward elimination, only the predictive components were included. Two of the four Y-variables are not explained by the model. c) An optimised model with 900 variables. During the backward elimination, all components needed to explain Y were included. This model explains all of the Y-variables. d) The CMV results are not biased by variable selection. The optimal number of components is Aopt = 2, which gives a residual variance of 60%.

8

L. Gidskehaug et al. / Chemometrics and Intelligent Laboratory Systems 93 (2008) 1–10

A backward elimination of variables using Amean components in each step is performed until a model of size Kopt is reached. The residual errors of this model are given in Fig. 6b. It is seen that the responses not explained by the full model are modelled even worse here. This is because the variables important for these responses are consistently neglected in the jack-knife. If the model ranks on the other hand are found according to Recommendation 2, all responses are included in the significance analysis. This results in the model in Fig. 6c. It is seen that all responses are well explained after 5 components and the optimal model rank Amean = 7 gives a residual variance of 20% for this model. It is important to realize, however, that Fig. 6b and c show only the cross-validated errors after variable selection and are thus optimistically biased. To obtain a residual error estimate which is unbiased by variable selection, the MSECMV in Fig. 6d must be used. The optimal number of components for future predictions is found conservatively by visual inspection. A model rank of Aopt = 2 gives a validated residual error of 60%, which is 11% better than for a twocomponent, full size model. The plot in Fig. 6d differs from the previous plots in that the solid lines denote the original responses instead of the orthogonalised ones. Specifically, the best explained groups are the Control group followed by the MT-group, whereas the groups including MP are hardly or not explained. It seems that the differences in genetic responses to MP, MPL and MPH are very small. There are not enough biological replicates available to model these treatments properly due to the low signal-tonoise ratio. We therefore attempt to pool the three MPtreatments into a single group of 38 samples. This model has an optimal size Kopt = 500 and rank Aopt = 3, and the predictive ability is high for all L = 3 groups. The cross model validated residual variance is 34%, as shown in Fig. 7. The classification correctness based on ŶCMV for the groups Control, MT and pooled MP is 98%, 91% and 76%. The purity, or the fraction of class-assignments which is correct, is 94%, 100% and 88% for the three groups, respectively. For these calculations, a sample with a predicted value larger than 0.5 is assigned to the relevant class. The score plot in Fig. 8 reveals

Fig. 8. A score plot for the calibration model with three classes based on Kopt = 500 variables. The validated, explained variance is 61% for the two first PLSR components.

three clearly separated groups in two components. While the model with 5 groups was unable to separate between the MPtreatments, the pooled MP-group is well modelled using less than 4% of the original variables. These 500 variables could be investigated further to understand the differences in genetic responses to chemotherapy with MP contra MT. The more minute differences between chemical agents in combination could not be detected due to a low signal-to-noise ratio combined with very few biological replicates. 4. Discussion As PLSR operates in a latent variable space, the risk that important variables are falsely removed in the backward elimination is small. The PLSR regression coefficients are calculated using a low-dimensional approximation of X, which means that no variables are overlooked in the jack-knife due to correlations and inversion problems. Backward elimination in combination with PLSR is therefore especially appealing. However, backward elimination is only one of several methods which attempt to find a good set of variables without performing an exhaustive search through all possible combinations. Other algorithms may exist which perform similarly or better, and the backward elimination may be replaced or supplemented for instance by a forward search. However, any such alternatives should be considered with speed of computation and ease of implementation in mind. Also, non-linear search algorithms highly dependent on parameter settings should be used with caution due to an associated risk of overfit. As the previous example indicates, there is an increased need for many replicates when the signal-to-noise ratio is low. There are many situations, however, for which enough replicates are hard to obtain. One example is for disease surveys with expression microarrays. First, the number of patients with the right diagnosis is in many cases limited. The patients must be willing to participate, they must fit a specific medical profile, and their age, sex, race, etc. should be in accordance with the experimental objective. In other cases, budget restraints may limit the number of arrays tested. There are also strict ethical guidelines to follow when performing for instance painful animal experiments. Such a limited availability of samples strongly speaks in favour of CMV over a separate validation set. Removal of many samples for a validation set will drastically decrease the predictive ability whenever few replicates are available. Utilisation of these samples in the training will improve the predictive ability of the model, and CMV will still provide an honest estimate of the residual error. If few samples are available, a validation set residual error will also be imprecise and very dependent on the subset of samples selected for testing. As all samples are included for testing in CMV, the residual error estimate will in such cases be more precise. In this work, the CMV results are used to find the optimal values for rank Aopt and size Kopt in the calibration model. As these parameters are not validated further, we cannot tell with absolute certainty that we do not introduce bias when we set these values. However, as is illustrated with a familiar example in Section 2.4, this is a risk commonly accepted when Aopt is

L. Gidskehaug et al. / Chemometrics and Intelligent Laboratory Systems 93 (2008) 1–10

found. More research is needed, for instance by permutation testing and using several data sets, to confirm if the bias introduced by Kopt is similarly negligible. To reduce the risk of selection bias due to Aopt and Kopt, these parameters must be conservatively chosen. For the estimation of model rank, this is assured by favouring less complex models if the predictive ability is similar. For the selection of model size, the shape of the curve for mean(MSEPo) in the area around Kopt is an indication of robustness. A sharp spike cannot be trusted, whereas a smooth, flat area around the minimum means that the exact value of Kopt is not critical for the predictive performance of the model. An example of this is seen in Fig. 5, where any model size between 600 and 1200 variables gives similar residual errors for the model with 5 treatment groups. The analyst is then free to reduce the risk of overfit by including more variables in the model. Oppositely, a stronger objective may be to find a small set of genes for closer biological studies. The mean (MSEPo) can then be used to make a conscious choice of model size which retains an acceptable predictive ability. It is argued by some that cross-validation will often be overly optimistic because a good fit to the training data do not guarantee predictability for future samples. This is a valid argument, however, it is a sampling problem rather than a validation problem. The validation can never correct for bad data. If your data do not sufficiently span the population on which inferences are made, your predictions will be biased no matter how you validate. Setting aside larger subsets of the data for validation will in this case not remove the bias, but produce a larger, more arbitrary bias. The increased bias will be detected in the validation, and the cruder calibration model will correctly be found to have a low predictive ability. Overfit is avoided by spanning the population better, not by removing more data from a set of samples which is already too small. Cross-validation is about making the best with the data you have. Performed correctly it will provide better calibration models because both the training data and the validation data will span a larger subspace of the population of interest. Even if the focus has been on expression data in this work, the methodology is useful for any data where the number of variables far exceeds the number of samples. Spectral data is interesting, because the correlation between neighbouring wavelengths can provide a visual representation of the variable selection. An excellent application of CMV in regression modelling of spectral data is given in Ref. [19]. Another example is given for 2D-proteomic data in Ref. [21], where jack-knife on unfolded images is shown to highlight temporally varying proteomic spots. The increased precision provided by CMV compared to a separate validation set comes at the price of increased computer time. However, this is not a discouraging element with the computational power and speed available today. A full CMV analysis of the magnitude presented in this work is performed within 1–2 h with use of a modern laptop computer. Finally, many of the recommendations presented are also applicable to models on which no variable selection is performed. For instance, the modified uncertainty test should be considered whenever jack-knife is performed on models with several responses in Y.

9

5. Conclusions A complete framework including the optimisation, validation and calibration stages of CMV is presented for bilinear regression models. It is emphasised that any variable selection must be validated, and CMV is shown to be an effective tool both for optimisation and validation when the number of objects is limited. Several recommendations are suggested to improve the outcome of the analysis. The new methodology is validated on an expression data set which is hard to analyse due to a very low signal-to-noise ratio and few biological replicates. Acknowledgements We gratefully acknowledge the Norwegian Microarray Consortium and the Functional Genomics (FUGE) initiative of the Norwegian Research Council (NFR) for financial support to our group. The reviewers are thanked for their constructive and thorough feedback. Appendix A. The modified uncertainty test The uncertainty test is given in Ref. [16] for testing of bilinear loadings and reformulated here for testing of regression coefficients. In this section, M and N denote the number of segments and samples, respectively, which are available in each step of the outer CMV. The significance score Ti2 for variable i ∈ {1, 2, …., P} is Ti2 ¼ bTi C1 i bi

ðA:1Þ

The vector bi is of length L and contains the estimated true regression coefficients for this variable. It can be found as the mean or median over cross-validation segments, or it may be recalculated based on all samples available in the current step. The variance–covariance matrix Ci is given as Ci ¼ ð1  bÞ  Cspecific þ b  Ctotal i Cspecific ¼ BCV  1bTi i i

T

 BCV  1bTi  g i

ðA:2Þ ðA:3Þ

The covariance estimate Ctotal is based on all variables and may be given as the mean or median across the variable-specific covariances Cispecific. The shrinkage coefficient β ∈ [0,1] controls the degree the total covariance is allowed to influence the specific covariance. It is seen that β = 0 corresponds to no shrinkage, whereas full shrinkage is performed for β = 1. The matrix BiCV is of size M by L and holds the estimated regression coefficients from the cross-validation. The scaling g ¼ MM1 ensures that Ti2 is the same as the t-score from the standard uncertainty test [14] when M = N, L = 1 and b = 0. When M b N g¼

N G  ; N G G

ðA:4Þ

where G denotes the number of left-out samples in each step.

10

L. Gidskehaug et al. / Chemometrics and Intelligent Laboratory Systems 93 (2008) 1–10

References [1] E.R. Malinowski, Factor Analysis in Chemistry, Wiley, New York, USA, 2002. [2] M.H. Cheok, W. Yang, C.-H. Pui, J.R. Downing, C. Cheng, C.W. Naeve, M.V. Relling, W.E. Evans, Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells, Nat. Genet. 34 (2003) 85–90. [3] D.B. Allison, X. Cui, G.P. Page, M. Sabripour, Microarray data analysis: from disarray to consolidation and consensus, Nat. Rev., Genet. 7 (2006) 55–65. [4] T. Speed (Ed.), Statistical Analysis of Gene Expression Microarray Data, Interdisciplinary Statistics, CRC Press, Boca Raton, USA, 2003. [5] A. Tropsha, P. Gramatica, V.K. Gombar, The importance of being earnest: Validation is the absolute essential for successful application and interpretation of QSPR models, QSAR Comb. Sci. 22 (2003) 69–77. [6] E. Anderssen, K. Dyrstad, F. Westad, H. Martens, Reducing over-optimism in variable selection by cross-model validation, Chemometr. Intell. Lab. 84 (2006) 69–74. [7] J.J. Kraker, D.M. Hawkins, S.C. Basak, R. Natarajan, D. Mills, Quantitative structure-activity relationship (QSAR) modeling of juvenile hormone activity: comparison of validation procedures, Chemometr. Intell. Lab. 87 (2007) 33–42. [8] C. Ambroise, G.J. McLachlan, Selection bias in gene extraction on the basis of gene-expression data, Proc. Natl. Acad. Sci. U. S. A. 99 (2002) 6562–6566. [9] J.S.U. Hjorth, Computer Intensive Statistical Methods: Validation Model Selection and Bootstrap, Chapman and Hall, London, UK, 1994. [10] M. Stone, Cross-validatory choice and assessment of statistical predictions, J. Roy. Stat. Soc. B Met. 36 (1974) 111–147.

[11] A.J. Hardy, P. MacLaurin, S.J. Haswell, S. de Jong, B.G.M. Vandeginste, Double-case diagnostic for outliers identification, Chemometr. Intell. Lab. 34 (1996) 117–129. [12] H. Martens, M. Martens, Multivariate Analysis of Quality: An Introduction, Wiley, Chichester, UK, 2001. [13] A. Höskuldson, PLS regression methods, J. Chemom. 2 (1988) 211–228. [14] H. Martens, M. Martens, Modified jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR), Food Qual. Prefer. 11 (2000) 5–16. [15] H.A. Martens, P. Dardenne, Validation and verification of regression in small data sets, Chemometr. Intell. Lab. 44 (1998) 99–121. [16] L. Gidskehaug, E. Anderssen, A. Flatberg, B.K. Alsberg, A framework for significance analysis of gene expression data using dimension reduction methods, BMC Bioinformatics 8 (2007) 346. [17] J. Trygg, S. Wold, O2-PLS, a two-block (X–Y) latent variable regression (LVR) method with an integral OSC filter, J. Chemom. 17 (2003) 53–64. [18] L. Gidskehaug, E. Anderssen, B.K. Alsberg, Cross model validated feature selection based on gene clusters, Chemometr. Intell. Lab. 84 (2006) 172–176. [19] F. Westad, N.K. Afseth, R. Bro, Finding relevant spectral regions between spectroscopic techniques by use of cross model validation and partial least squares regression, Anal. Chim. Acta 595 (2007) 323–327. [20] National center for biotechnology information, http://www.ncbi.nlm.nih.gov/. [21] E.M. Færgestad, M. Rye, B. Walczak, L. Gidskehaug, J.P. Wold, H. Grove, X. Jia, K. Hollung, U.G. Indahl, F. Westad, F. van den Berg, H. Martens, Pixel-based analysis of multiple images for the identification of changes: a novel approach applied to unravel proteome patterns of 2-D electrophoresis gel images, Proteomics 7 (2007) 3450–3461.

Suggest Documents