This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright
Author's personal copy Computational Statistics and Data Analysis 56 (2012) 1016–1027
Contents lists available at SciVerse ScienceDirect
Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda
Uncertainty estimation with a finite dataset in the assessment of classification models Weijie Chen a,∗ , Waleed A. Yousef b , Brandon D. Gallas a , Elizabeth R. Hsu a , Samir Lababidi c , Rong Tang c , Gene A. Pennello c , W. Fraser Symmans d , Lajos Pusztai d a
Division of Imaging and Applied Mathematics, Office of Science and Engineering Laboratories, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, MD, USA b
Computer Science Department, Faculty of Computers and Information, Helwan University, Egypt
c
Division of Biostatistics, Office of Surveillance and Biometrics, Center for Devices and Radiological Health, Food and Drug Administration, Silver Spring, MD, USA
d
Departments of Breast Medical Oncology and Pathology, University of Texas M. D. Anderson Cancer Center, Houston, TX, USA
article
info
Article history: Available online 16 June 2011 Keywords: Uncertainty Training variability Microarray classification Area under the ROC curve (AUC)
abstract To successfully translate genomic classifiers to the clinical practice, it is essential to obtain reliable and reproducible measurement of the classifier performance. A point estimate of the classifier performance has to be accompanied with a measure of its uncertainty. In general, this uncertainty arises from both the finite size of the training set and the finite size of the testing set. The training variability is a measure of classifier stability and is particularly important when the training sample size is small. Methods have been developed for estimating such variability for the performance metric AUC (area under the ROC curve) under two paradigms: a smoothed cross-validation paradigm and an independent validation paradigm. The methodology is demonstrated on three clinical microarray datasets in the microarray quality control consortium phase two project (MAQC-II): breast cancer, multiple myeloma, and neuroblastoma. The results show that the classifier performance is associated with large variability and the estimated performance may change dramatically on different datasets. Moreover, the training variability is found to be of the same order as the testing variability for the datasets and models considered. In conclusion, the feasibility of quantifying both training and testing variability of classifier performance is demonstrated on finite real-world datasets. The large variability of the performance estimates shows that patient sample size is still the bottleneck of the microarray problem and the training variability is not negligible. Published by Elsevier B.V.
1. Introduction Classification models are commonly employed in DNA microarray analysis to combine multiple gene expression measurements into an index to predict clinical endpoints, for example, to discriminate between responders and nonresponders to a particular therapy or to discriminate between the presence and absence of a specified disease state, among many other applications (see, e.g., Golub et al., 1999; van’t Veer et al., 2002, and many others). This classification problem is particularly challenging because a typical microarray dataset consists of a large number (p) of genes and only a small
∗
Corresponding author. Tel.: +1 301 7962663; fax: +1 301 7969925. E-mail addresses:
[email protected] (W. Chen),
[email protected] (W.A. Yousef),
[email protected] (B.D. Gallas),
[email protected] (E.R. Hsu),
[email protected] (S. Lababidi),
[email protected] (R. Tang),
[email protected] (G.A. Pennello),
[email protected] (W.F. Symmans),
[email protected] (L. Pusztai). 0167-9473/$ – see front matter. Published by Elsevier B.V. doi:10.1016/j.csda.2011.05.024
Author's personal copy W. Chen et al. / Computational Statistics and Data Analysis 56 (2012) 1016–1027
1017
to moderate number (n) of samples, i.e., n ≪ p. To successfully translate genomic classifiers to the clinical practice, it is essential to obtain reliable and reproducible measurement of the classifier performance in this n ≪ p setting − in terms of some metric, e.g., AUC (area under the receiver operating characteristic (ROC) curve) or error rate. There are two kinds of errors associated with the classification performance estimation: bias and variance. A common source of optimistic bias in microarray analysis is the repeated use of the limited dataset for multiple parameter estimation tasks, sometimes called resubstitution, leading to overfitting bias. For example, if a dataset is first used for feature selection and subsequently partitioned (by cross-validation or split sample) for classifier training and testing, the performance is optimistically biased as demonstrated by Ambroise and McLachlan (2002) on real-world datasets and Simon et al. (2003) on simulated datasets. Proper use of the available dataset for classifier performance assessment to avoid overfitting bias should be the baseline of the ‘‘best practice’’ in microarray classification problems. Even given an unbiased estimate of the classifier performance, the estimate suffers from another kind of error: the variance. A point estimate of the classifier performance has to be accompanied with a measure of its variability, which we use to draw confidence intervals and assess the statistical significance of differences between models. Variability (or standard error) estimation is also an essential step in methodologies proposed in the literature for sizing a microarray study (see, e.g., Mukherjee et al., 2003). Microarray classifier development and validation consist of multiple steps: data normalization, feature selection, model selection, classifier parameter estimation, and classifier testing, to name a few. Each step may contribute to the variability of the classifier performance when that step is treated as a random process. We consider these steps to be random processes when they are a function of data, which they often are. Non-random processes for these steps strictly appeal to biology, chemistry, and engineering. However, the most prevailing practice is to only use the testing sample to calculate the variance, which implicitly assumes all the other training procedures (i.e., parameter estimation with a finite dataset) are fixed. Such a practice may lead to unstable findings in the n ≪ p setting, i.e., results that are found to be significant by only considering the testing variance may not be significant when the training set varies. For example, Michiels et al. (2005) studied the stability of the molecular signature and the classification accuracies of seven published microarray studies, each one of which is based on a fixed training set. Michiels et al. (2005) reanalyzed the seven studies by randomly resampling multiple training sets and found that ‘‘Five of the seven studies did not classify patients better than chance’’. Another well-known example is the exchange between Dave et al. and Tibshirani. Dave et al. (2004) developed a model for predicting patient survival from gene expression data and their model had a highly significant p value in an independent test set. Tibshirani (2005) reanalyzed their data by swapping their training and test sets and found that the results were no longer significant and concluded that Dave et al.’s result is extremely fragile. All these examples indicate that the stability (or fragility) of a classifier is a fundamentally important property, especially in the n ≪ p setting. Therefore, we propose accounting for both training and testing in the assessment of the standard deviation (SD) of estimated classifier performance, which accompanies the classification accuracy in reports or is used to perform statistical significance tests. Although one can always use a Monte Carlo approach to assess the variability (of a performance estimator) in simulation studies, it is not trivial to estimate it with a single dataset in real-world applications. This is particularly true when one attempts to account for the training variability in addition to the testing variability. Despite the popularity of crossvalidation (CV) methods, Bengio and Grandvalet (2004) have shown the difficulty of estimating the variability of CV based performance estimators. In an effort to improve CV, Efron and Tibshirani (1997) proposed bootstrap based error rate estimators including the leave-one-out (LOO) bootstrap and the.632+ estimator. Efron and Tibshirani (1997) introduced the influence function approach to estimate the variability (that accounts for both training and testing) of the LOO bootstrap estimator of error rate and pointed out that ‘‘it is more difficult to obtain standard error estimates for cross-validation or the.632+ estimator’’. Recently, Jiang and Simon (2007) proposed and studied the variability of a repeated LOO bootstrap estimator with simulations; however, they did not provide a means to estimate its variability with a single dataset. Yousef et al. (2005) extended Efron’s influence function approach to estimate the uncertainty of classifier performance in terms of AUC in a leave-pair-out bootstrap paradigm, which we call Paradigm One (see Section 2.2.1). Yousef et al. (2006) further investigated a U-statistics based approach to estimate the uncertainty of classifier performance in terms of AUC in a ‘‘two datasets—one for training and one for testing’’ paradigm, which we call Paradigm Two (see Section 2.2.2). We advocate using the AUC as a classification performance metric because of its desirable properties, including independence of disease prevalence, independence of model calibration, and independence of a decision threshold (Metz, 1978). Specifying a threshold requires knowledge of disease prevalence and utilities of each decision/truth combination and should be done at a later stage (Green and Swets, 1966; Metz, 1978), where an intended use (screening vs. diagnostic) and an intended population (general population vs. high risk) are specified. AUC provides a meaningful summary measure of the separability of the distributions of the positive-class scores and the negative-class scores and is a widely used performance metric in classification problems, especially in diagnostic medicine (Bamber, 1975; Metz, 1978; Hanley and McNeil, 1982; Bradley, 1997; Pepe, 2003; Fawcett, 2006; Hanczar et al., 2010). In this paper, we present our classifier assessment strategies that account for both training and testing in the estimation of the uncertainty of AUC using the techniques developed by Yousef et al. (2005, 2006). We demonstrate the applications of our methods on the clinical microarray datasets in the microarray quality control consortium phase two project (MAQC-II) (Shi et al., 2010).
Author's personal copy 1018
W. Chen et al. / Computational Statistics and Data Analysis 56 (2012) 1016–1027
Fig. 1. Illustrative example showing the relationships among conditional performance, mean performance, and infinite training performance as the size of training set varies. For every training set size (on the x-axes), only one instance of the conditional performance (*) is shown for a particular training dataset. The mean () is calculated over many instances of the conditional performance (*) from many training sets.
2. Methods 2.1. General considerations We begin with some general considerations and definitions of some fundamental quantities in the assessment of classifiers. The classification performance (e.g., in terms of AUC ) of a classifier trained with a training set and tested over the population is called the conditional performance; it is the performance that is conditional on a particular finite-size training set and would vary when a different training set is used to train the classifier. The variability due to the random choice of a finite-size training set is called training variability. The expectation of the conditional performance over the population of training sets of the same size is called the mean performance. The mean performance is then a function of the training set size but not conditional on a particular training set. As the training size approaches infinity, the training variability asymptotically vanishes and the mean performance asymptotically converges to the infinite training performance. Fig. 1 illustratively shows an example of what the relationships of these quantities may look like. We use mean performance as the reference for our calculations of performance bias: we take the difference between the expectation of the performance estimator and the mean performance at a given training sample size. It is well known that classifier performance is overestimated by training and testing the classifier using the same dataset. To avoid such resubstitution bias given a finite dataset, cross-validation (CV) approaches are popularly used, such as leave-one-out CV or K -fold CV. However, there are some drawbacks for these conventional CV methods. First, the low bias of the performance estimate is often associated with high variance (Efron and Tibshirani, 1997). Second, variance estimation using the test scores for each subject in CV only accounts for the finite size of the testing set. Third, variance estimation using the sample variance of K performance estimates in K -fold CV cannot be unbiased because of the correlations among the training samples as shown theoretically in Bengio and Grandvalet (2004). Variance estimation techniques are usually validated with simulation studies where the true variance can be obtained through the Monte Carlo method. We have reported simulation studies elsewhere (Chen et al., 2009) where we showed that the estimated variance of the classifier performance in leave-one-out CV is biased low (i.e., anti-conservative), especially in the low intrinsic separation, large dimensionality, and low sample size settings. Our approaches aim to estimate the mean performance of a classifier. We treat the finite-size training set as a random effect, including it with the random effect of the finite size of the testing set in our variance estimates. This is a more general approach than the traditional approaches that treat the finite training as a fixed effect. This general approach is particularly important when the sample size is small and the model is less likely to be stable. One can imagine that, if the training variability is large, one may obtain a high performance by training the classifier with a ‘‘lucky’’ finite training set that is coincidentally similar to the testing set. However, one may find the performance could not be reproduced later on when the classifier is retrained with a new training set. As such, the training variability is a measure of classifier stability and therefore is itself a quantity of interest. We expect a ‘‘good’’ classifier to be ‘‘stable’’ with varying training sets, i.e., to have small training variability. 2.2. Uncertainty estimation techniques The assessment of a classifier typically happens in one of the two following paradigms. Paradigm One has only one dataset available and therefore some form of cross-validation, or more generally some resampling technique, is needed to train and
Author's personal copy W. Chen et al. / Computational Statistics and Data Analysis 56 (2012) 1016–1027
1019
test the classifier. Paradigm Two has two datasets available, a ‘‘split sample’’, one for training and one for testing the classifier. We outline the techniques for estimating the mean performance of a classifier, in terms of AUC , and the training and testing uncertainty associated with the estimated performance of the classifier in these two scenarios. 2.2.1. Paradigm One: resampling approach When resampling from a single dataset is the only available option (e.g., when the dataset is small and any split is not practical), we propose using a smoothed version of cross-validation for assessing a classifier that is based on the work of Efron and Tibshirani (1997). They proposed the leave-one-out bootstrap method on the performance metric error rate and their technique was extended by Yousef et al. (2005) to the performance metric AUC . This method uses a leave-pair-out bootstrap (LPOB) approach to estimate the mean AUC and an influence function method (i.e., ‘‘delta method after bootstrap’’) to estimate the training and testing variability of the estimated mean AUC . The use of ‘‘leave-pair-out’’ rather than ‘‘leave-oneout’’ is because the nonparametric Wilcoxon–Mann–Whitney kernel for AUC is a two-sample statistic. Denote the number of actually-positive subjects as N1 , the number of actually-negative subjects as N0 , and the number of the bootstrap training
sets as B. The LPOB AUC , AUC AUC
(1,1)
=
N0
1
(1,1)
, is defined as
N1
N0 N1 i=1 j=1
(1,1)
Ψi,j
,
(1,1)
is the averaged Wilcoxon–Man–Whitney kernel, for a subject pair (i, j) with feature vectors x0i and x1j where Ψi,j respectively, averaging over all the training sets that do not contain this pair of subjects, i.e., (1,1)
Ψi ,j
=
B
Iib Ijb Ψ (hˆ tr ∗b (x0i ), hˆ tr ∗b (x1j ))/
B
′
′
Iib Ijb ,
b ′ =1
b =1
where hˆ tr ∗b (x0i ) denotes the output of the classifier that is trained with a (bootstrap) training set tr ∗b and tested on an actually-negative subject i, Iib is one if i ∈ tr ∗b and zero otherwise, and hˆ tr ∗b (x0i ) and Ijb are defined similarly for an actuallypositive subject j, and Ψ is defined as 1 0.5 0
Ψ ( a, b ) ≡
a b.
In the LPOB approach, multiple (e.g., 5000) training sets are obtained by bootstrap resampling and each training set is used to train the classifier. In testing, any pair of subjects (one from the positive class and one from the negative class) are tested on the classifiers that do not contain these two subjects in their bootstrap training set. The Wilcoxon–Mann–Whitney statistic of the testing results of a pair of subjects is averaged over bootstrap training sets and is used to estimate the mean AUC . A unique advantage of this technique is that, by averaging over training sets, the classification performance is a smooth function of the subject samples. The smoothness property allows for estimating the variability of the AUC estimator using the ‘‘delta method after bootstrap’’. This is realized with a construct called the influence function that is effectively a derivative of the performance metric in the direction of each sample. In addition, the influence function method can be applied to estimate the variance of the difference between two AUC s that correspond to two classifiers applied to the same set of patients. This allows for assessment of the statistical significance of the difference between two classifiers. Mathematically,
the variance of AUC (1,1) = Var AUC
(1,1)
is estimated by the ‘‘delta method after bootstrap’’ and can be written as
N0 1
N02 i=1
ˆ 20i + D
N1 1
N12 j=1
ˆ 21j , D (1,1)
ˆ 01 , . . . , Dˆ 0N0 , Dˆ 11 , . . . , Dˆ 1N1 ) is the empirical influence function (EIF) of AUC where (D . For details of the calculation of the EIF, see Yousef et al. (2005). We point out that the variance assessment approach above can be extended to assess the variance of the difference between two LPOB AUC estimators that correspond to two classifiers applied to the same set of patients. Denote (1,1)
1 the EIFs for two LPOB AUC estimators AUC
(1,1)
2 and AUC
(1)
(1)
ˆ (1) ≡ (Dˆ 01 , . . . , Dˆ 0N , Dˆ 111 , . . . , Dˆ 11N ) and Dˆ (2) ≡ as D 0 1
(11,1) − AUC (21,1) is Dˆ (1) − Dˆ (2) . The variance of ∆ ˆ = AUC ˆ (Dˆ 201 , . . . , Dˆ 20N0 , Dˆ 211 , . . . , Dˆ 21N1 ) respectively. Then the EIF for ∆ is
ˆ = Var ∆
N0 1
N02 i=1
N1 1 (Dˆ (0i1) − Dˆ (0i2) )2 + 2 (Dˆ (1j1) − Dˆ (1j2) )2 .
N1 j=1
ˆ is asymptotically normal, one can assess the significance of the difference using the z score test with the test statistic Since ∆ ˆ . ˆ z = ∆/ Var ∆
Author's personal copy 1020
W. Chen et al. / Computational Statistics and Data Analysis 56 (2012) 1016–1027
a
b
Fig. 2. Diagrams for our three-stage assessment study design: data mining, pilot study, and pivotal study. (a) Paradigm One: a single dataset is used for the pilot study; note that this dataset should be independent of the data-mining dataset. (b) Paradigm Two: two datasets are used for the pilot study; the data-mining dataset can be reused for training the classifier. In either paradigm, both training and testing are accounted in the estimate of the variance of the classifier performance. In the pivotal study stage, the training set can include the datasets in the first two stages while the test set must be independent.
2.2.2. Paradigm Two: split sample In Paradigm Two, two datasets are available − one for training the classifier and one for testing the trained classifier to estimate the mean classifier performance. We use a technique developed by Yousef et al. (2006) to estimate the variability of the performance estimate that accounts for both training and testing. By treating both the training and the testing as random effects, we decompose the variance of the estimated performance as
= Etr [Varts AUC ] + Vartr [Ets AUC ], Vartr ,ts AUC is the estimate of the AUC , ‘‘Var’’ designates variance, ‘‘E’’ designates expectation, ‘‘tr’’ designates training, where AUC and ‘‘ts’’ designates testing. In the equation above, the variability of the estimated AUC includes both training and testing (i.e., random choice of a finite training sample and the random choice of a finite testing sample). The first term is the expectation (over multiple training sets) of the testing variability and can be interpreted as a measure of the (average) testing variance. The second term is the variance (over multiple training sets) of the conditional performance and is a measure of the training variance or classifier stability. In the commonly used approach where the training is treated as a fixed effect, only the quantity in the bracket of the first term is estimated and reported; this quantity can be estimated using the methods developed by Bamber (1975) or DeLong et al. (1988). Yousef et al. (2006) derived a U-statistics-based approach for estimating the training and testing variance components and their sum as indicated in the equation above. For more details, see Yousef et al. (2006). 2.3. Study designs The term ‘‘training’’ can generally include many data analysis procedures that involve a finite dataset for parameter estimation, such as feature selection, classifier parameter estimation etc. In principle, the techniques described above can be used to quantify the variability caused by all the general training procedures by repeating them with multiple training sets (e.g., bootstrap). However, here we consider a three-stage study design as illustrated in Fig. 2. At the first stage, a dataset is used for data mining, e.g., feature selection and model selection. Other non-automatic procedures such as incorporation of prior biological knowledge can also be involved in this stage. Stage two is a ‘‘pilot study’’, the purpose of which is to assess the performance of the selected classifier and the selected features, i.e., estimating the performance and the uncertainty. At this stage, a prospective plan should outline the features and the classifier architecture that are determined in the first stage and the classifier should be tested on patient samples that are never involved in the first stage. Any deviation from the prospective plan (e.g., tweaking parameters or dropping/adding features by looking at and trying to ‘‘improve’’ the test performance) is strictly prohibited in this stage to avoid multiple selection bias. This step serves as ‘‘an interim look’’ at the classifier performance and if promising, the classifier is further validated in a large pivotal study (stage three). In this design, many preprocessing procedures are included in the data-mining stage, the output of which is then assumed to be fixed. The ‘‘training’’ variability as assessed in the pilot-study stage, therefore, only refers to the AUC variability caused by the randomness of the finite training set that is used for estimating the classifier parameters. This design at least has the advantage of reducing the computation load since repeating the data-mining procedure multiple times may be very timeconsuming. This design also allows for some ad hoc procedures in the data-mining stage while maintaining the rigorousness of classifier assessment in the pilot and pivotal stages. We consider two paradigms in the pilot-study stage. Paradigm One (Fig. 2(a)) is that a single dataset (M subjects) is available to assess a classifier using the LPOB and influence function method to estimate the mean performance and the uncertainty in that estimate. Note that this dataset should be completely independent of the data-mining dataset to avoid resubstitution bias since every subject in this dataset is used to test the classifier. When the size (M) of the pilot dataset is extremely small, however, the results may be substantially (conservatively) biased due to the small training size, which is
Author's personal copy W. Chen et al. / Computational Statistics and Data Analysis 56 (2012) 1016–1027
1021
effectively M /2 from bootstrapping (Efron and Tibshirani, 1997). Another practical difficulty is failure to converge for some classifiers (e.g., logistic regression) due to the small sample size. An alternative paradigm for the pilot study, Paradigm Two (Fig. 2(b)), is to use one dataset to train the classifier and then test the classifier on a new dataset of M subjects to estimate the classifier performance. The training set can possibly be the dataset used in the data-mining stage. In this paradigm, we use the bootstrap and the U-statistics based technique to assess the variance of the estimated performance. Ideally, three independent datasets are needed for data mining, classifier training, and classifier testing respectively. The reuse of the N subjects for both feature selection and classifier training may be a practically useful compromise due to limited patient samples (yet the N subjects are completely independent of the test set of M subjects). If the classifier performance is promising in the pilot study, it will then be further evaluated in a pivotal study. The ‘‘traditional hygiene’’ for a pivotal study is to freeze the classifier at the outset and collect independent data for testing. Note that the training set can include the datasets used in the first two stages. While it might be satisfactory to assess the conditional performance given a sufficiently large training sample size, it may still be desirable to perform an uncertainty analysis using the techniques in Paradigm Two to obtain a measure of the classifier stability. 3. Results We conducted two case studies with clinical datasets of MAQC-II to demonstrate the application of our approach to the assessment of classification models in microarray analysis. 3.1. Uncertainty estimates of a previously reported breast cancer chemotherapy response predictor In the study by Hess et al. (2006), a genomic predictor was constructed to predict whether a breast cancer patient would have pathologic complete response (pCR) or residual disease (RD) after preoperative chemotherapy. A dataset consisting of 133 subjects was sequentially partitioned into a data-mining set (82 subjects) and a validation dataset (51 subjects). With the data-mining dataset, they ranked the genes with a t-test method and selected the top 31 genes and diagonal linear discriminant analysis (DLDA) as their best classifier. Then they tested the performance of the DLDA classifier on the remaining (independent) 51 subjects. In the current MAQC-II analysis, we used 130 subjects of this previously published breast cancer dataset (3 subjects did not pass the quality control process of the MAQC-II). This corresponds to a data-mining set of 81 subjects and a dataset of 49 subjects for a pilot assessment study. Furthermore, MAQC-II released another new dataset of 100 subjects for validation purpose. Patient characteristics in the three distinct breast cancer datasets are presented in Table 1. With these available datasets and the features selected by Hess et al. (2006) using the 81 subjects, we conducted the following three pilot assessment studies. (1) Following Paradigm One, we applied the leave-pair-out bootstrap (LPOB) to the independent 49 subjects and assessed the uncertainty with the influence function method. This represented a pilot study to assess the DLDA model and features that were selected during Hess’s data mining using the 81 subjects. Furthermore, we combined the original 49 validation subjects with the new 100 validation subjects and performed a pilot assessment study on the total of 149 subjects with the LPOB approach. (2) Following Paradigm Two, we considered the 81 data-mining subjects as the pilot-study training set and the independent 49 subjects as the pilot-study test set. We repeated the study by using the 81 subjects as the training set and the combined 149 subjects as the test set. Furthermore, with the training set of the 81 subjects and the test set of 49 subjects, we compared three classifiers, DLDA, LDA, and quadratic discriminant analysis (QDA), in terms of their mean performance, testing variability, and training variability. (3) Following Paradigm Two, instead of reusing the 81 subjects for both data mining and pilot-study classifier training, we used the 49 subjects as the training set and the 100 subjects as the test set. We also repeated this assessment by swapping the training and testing sets. The results of these assessment studies are shown in Figs. 3–6. Figs. 3–5 plot, for various of combinations of training and test sets, the estimated mean performances against the number N (N = 2, . . ., 30) of top features as ranked by Hess et al. (2006). The error bars represent two standard deviations (SD) above and below the mean performance, accounting for training and testing. Fig. 6 plots the mean performance, the testing variability, and the training variability for three classifiers: DLDA, LDA, and QDA. We summarize our findings as follows. First, each estimated performance is associated with large uncertainty (Figs. 3–5). Furthermore, since Fig. 6 shows that the training variability is negligible for the DLDA classifier, most of the uncertainty seen in Figs. 3–5 is from testing. Next, Fig. 6(a) shows that classifiers with increased complexity are associated with decreased performance and increased testing and training variability (Fig. 6(b) and (c)). This quantitatively demonstrated that simple classifier is advantageous than more complex ones in low sample size settings. Finally, for each classifier, the increase in the number of features may increase both the separation ability of the classifier and the complexity of the problem, and hence, there is a trade-off. It is observed that, the optimal number of features in this problem is 16, 5, and 3 for DLDA, LDA, and QDA respectively (with the optimality here defined as the maximum AUC and minimum (testing) variance). This means that, for more complex classifiers, as the
Author's personal copy 1022
W. Chen et al. / Computational Statistics and Data Analysis 56 (2012) 1016–1027 Table 1 Clinical information and demographics of the MDACC breast cancer patients. Platform
N = 81 N = 49 Affymetrix U133A
N = 100
Specimens
Fine needle aspiration
Female (n) Median age
81 (100%) 52 years
49 (100%) 50 years
100 (100%) 50 years
Race Caucasian African American Asian Hispanic Mixed
55 (68%) 11 (14%) 7 (9%) 6 (7%) 2 (2%)
30 (59%) 2 (6%) 2 (4%) 15 (31%) 0
68 (68%) 12 (12%) 7 (7%) 13 (13%) 0
Cancer histology Invasive ductal (IDC) Mixed ductal/lobular (IDC/ILC) Invasive lobular (ILC) Others
72 (89%) 6 (7%) 1 (1%) 2 (2%)
47 (96%) 2 (4%) 0 0
85 (85%) 8 (8 %) 7 (7%) 0
Tumor size T0 T1 T2 T3 T4
0 7 (9%) 46 (57%) 14 (17%) 14 (17%)
1 (2%) 5 (10%) 24 (49%) 7 (14%) 12 (24%)
2 (2%) 8 (8%) 62 (62%) 13 (13%) 15 (15%)
Lymph node stage N0 N1 N2 N3
27 (33%) 38 (47%) 8 (10%) 8 (10%)
12 (24%) 23 (47%) 6 (12%) 8 (16%)
27 (27%) 47 (47%) 13 (13%) 13 (13%)
Nuclear grade (BMN) 1 2 3 Estrogen receptor positive1 Estrogen receptor negative HER-2 positive2 HER-2 negative
2 (2%) 29 (36%) 50 (62%) 46 (57%) 35 (43%) 25 (31%) 56 (69%)
0 23 (47%) 26 (53%) 34 (69%) 15 (31%) 8 (17%) 40 (83%)
11 (11%) 42 (42%) 47 (47%) 60 (60%) 40 (40%) 7 (7%) 93 (93%)
Neoadjuvant therapy3 Weekly T × 12 + FAC × 4 3-weekly T × 4 + FAC × 4 Pathologic complete response (pCR) Residual Disease (RD)
68 (84%) 13 (16%) 21 (26%) 60 (74%)
44 (90%) 5 (10%) 12 (24%) 37 (76%)
98 (98%) 2 (2%) 15 (15%) 85 (85%)
1 Cases where >10% of tumor cells stained positive for ER with immunohistochemistry (IHC) were considered positive. 2 Cases that showed either 3+ IHC staining or had gene copy number > 2.0 were considered HER-2 ‘‘positive’’. 3 T = paclitaxel, FAC = 5-fluorouracil, doxorubicin, and cyclophosphamide.
number of features increases, the increased separation ability is more quickly counteracted by the increased complexity. It should be noted that each variance estimated with the finite data is associated with an estimation error, which explains the noise observed in the estimated variance curves in Fig. 6(b) and (c). However, the standard error of the estimated variance of the classifier performance is not analytically available and usually only computed in Monte Carlo simulations for validation purpose. Second, the estimated mean performance substantially decreased when assessed on the larger dataset compared to the smaller dataset. Fig. 3 shows that when the mean performance is assessed on a single dataset, the estimated mean performance substantially decreased when the dataset expanded from 49 subjects to 149 subjects. Fig. 4 shows that when the training set (81 subjects) is fixed and the test set is expanded from 49 subjects to 149 subjects, the mean performance substantially decreased. Fig. 5 shows that, when the training set and test were swapped, the mean performance were substantially different. A larger training set of N = 100 yielded better performance than the smaller training set of 49 subjects. These results indicate that the training set sizes that were used here might be too small to generate optimal predictors with maximum stability and generalizability for our classification problem. In all these results, the genomic predictors show promise in the sense that their predictions of the pCR/RD status of patients appear to be significantly better than a random guess by taking into account both training and testing as random effects. However, the large variability and the fact that the estimated performance changes dramatically on different datasets suggest that (1) the datasets differ in the distribution of patient characteristics that correlate with clinical endpoints, and (2) the learning curves for the classifiers do not plateau at the sample size of the smaller dataset. It seems that the dataset of
Author's personal copy W. Chen et al. / Computational Statistics and Data Analysis 56 (2012) 1016–1027
1023
Fig. 3. Mean performance and uncertainty of DLDA as estimated using LPO bootstrap and influence function approach, with a dataset consisting of 49 subjects and also estimated with a dataset consisting of 149 subjects. Features were ranked with an independent data-mining dataset (81 subjects) by Hess et al. (2006).
Fig. 4. Mean performance and uncertainty of DLDA as estimated using bootstrap and U-statistics approach with two datasets: training set of 81 subjects and testing set of 49 subjects; also training set of 81 subjects and testing set of 149 subjects.
49 subjects is made up of easier samples than the dataset of 100 subjects. Indeed, the two subsets differ substantially with regards to the proportion of HER-2 positive patients and nuclear grade of cancers. The N = 100 set has fewer patients with HER-2 positive cancer and a larger proportion of low grade tumors than either the training set of N = 81 or the first test set of N = 49 (Table 1). This is relevant because HER-2 positive cancers and high grade tumors represent a more chemotherapysensitive subset of breast cancers (Andre et al., 2008). As such, a pivotal trial should identify and prospectively sample from an intended use population and perhaps design the trial for subgroup analysis. 3.2. Predictor development and assessment for the breast cancer, multiple myeloma and neuroblastoma datasets We also developed our own genomic classification models to predict the pCR/RD status for the breast cancer dataset (BR) and overall survival (OS) milestone endpoints for the other two clinical datasets: multiple myeloma (MM) (Shaughnessy et al., 2007) and neuroblastoma (NB) (Oberthuer et al., 2006). The OS milestone endpoints are obtained by applying a critical threshold to the overall survival time of the patients (i.e., define OS > 24 months as negative for MM and OS > 30 months as negative for NB). In the data-mining stage, we explored multiple combinations of methods for data normalization, feature selection, and classification. BR and MM data were normalized by the commonly used Microarray Suite (MAS) 5.0 algorithm and also by a novel approach called reference set robust multi-array average (RMA) (refSetRMA). The refSetRMA method, as an extension of the RMA method (Irizarry et al., 2003), sets aside a reference set of arrays and normalizes any other array by applying the RMA procedure to the reference set plus that array. An advantage of using refSetRMA is that all samples used in training and testing sets are normalized to the same reference set of arrays; thereby, reducing or possibly eliminating any batch
Author's personal copy 1024
W. Chen et al. / Computational Statistics and Data Analysis 56 (2012) 1016–1027
Fig. 5. Mean performance and uncertainty of DLDA as estimated using bootstrap and U-statistics approach with two datasets: training set of 100 subjects and testing set of 49 subjects; also training set of 49 subjects and testing set of 100 subjects.
a
b
c
Fig. 6. Mean performance (a), measure of testing variability (b), and measure of training variability or classifier stability (c), for three classifiers with increasing complexity: DLDA, LDA, QDA. Assessed with a training set of 81 subjects and a testing set of 49 subjects.
effect biases. For NB, the standard two-color normalization method was used where the geometric mean was taken over the ratios of samples to be the reference for the two pairs of dye swapped arrays and the ratios were based on the backgroundsubtracted, linear Loess normalized intensity data. We had two tracks of methods for feature selection. The first track consisted of two statistical filters (fold change plus p value volcano plot, and family-wise error rate or more stringent p value) and two biologically based feature expansion methods (correlation with important clinical covariates, and gene ontology using GOMiner). The second track consisted of two statistical filters, a threshold for area under the ROC curve adjusted for time to event information (Heagerty et al., 2000) when available, and LASSO L1 penalization to shrink logistic regression coefficients (http://cran.r-project.org/web/packages/penalized/index.html).
Author's personal copy W. Chen et al. / Computational Statistics and Data Analysis 56 (2012) 1016–1027
1025
Table 2 Summary of data mining and pilot-study assessment results. For all the three datasets, the logistic regression model was selected as the candidate classifier. Dataset
BR
MM
NB
Data mining
Data partition Normalization Training size Testing size Number of features Cond. AUC (testing SD) Final model: Training set size Final model: # features
5-fold CV refSetRMA 4/5 × 115 1/5 × 115 15 0.80(0.08) 115 (27+, 88−) 25
Split sample MAS 5.0 227 113 20 0.76(0.07) 340 (51+, 289−) 53
Split sample MAS 5.0 188 50 11 0.90(0.06) 238 (22+, 216−) 13
Pilot study
Testing set size Mean AUC Approximate 95% CI Tr ,Ts SD Tr Training SD: SD Ts Average testing SD: SD
100 (15+, 85−) 0.73 (0.53, 0.93) 0.10 0.06 0.08
214 (27+, 187−) 0.62 (0.44, 0.70) 0.07 0.01 0.06
177 (39+, 138−) 0.84 (0.74, 0.94) 0.05 0.03 0.04
Three classifiers were evaluated: logistic regression, DLDA, and diagonal quadratic discriminant analysis (DQDA). The whole process (including feature selection and classifier training) was internally validated by split sample or 5-fold crossvalidation. One process was selected for each dataset based on the internal validation performance in terms of AUC . The selected candidate processes in the data-mining stage are summarized in Table 2. With the selected candidate processes, the entire training set was used to select features and train the selected classifier resulting in three final models, which was then assessed in the pilot-study stage (Table 2). For the BR dataset, the best process consists of the refSetRMA normalization (15 subjects set aside for this task), the logistic regression classifier and 15 genes from track-one feature selection. The best internal validation performance, a conditional AUC (SD) of 0.80 (0.08), was obtained in a five-fold CV of the 115 subjects. For the MM dataset, the best process consists of the MAS 5.0 normalization, the logistic regression classifier and 20 genes from track-one feature selection. The best internal validation performance, a conditional AUC (SD) of 0.76 (0.07), was obtained in a split-sample paradigm (227 subjects for training and 113 subjects for testing). For the NB dataset, the best model consists of the standard two-color normalization, the logistic regression classifier and 11 genes from track-two feature selection. The best internal validation performance, a conditional AUC (SD) of 0.90 (0.06), was obtained in a split-sample paradigm (188 subjects for training and 50 subjects for testing). The final classification model was obtained based on the best process using the entire dataset as the training set. For each of the above three clinical studies, even though the best model in the data-mining stage and the final model to be assessed in the pilot study were obtained from the same process (normalization, feature selection and classification method), they differed in that the feature selection and classifier training were performed on different datasets. Moreover, the best model was selected from multiple models and its performance may be subject to a selection bias. Therefore, the performance of the best model estimated in the data-mining stage only served as a criterion for model (features and classifier) selection rather than as a claim of the classifier performance that is generalizable to the population (as such, the SD was the traditional testing SD that was calculated using methods by DeLong et al. (1988)). The generalizable performance of these selected models should be assessed with an independent dataset, which we do in the pilot study. In the pilot-study stage, the final model and features were assessed by an independent dataset that was collected as part of the MAQC-II project. We employed Paradigm Two for the assessment of the three final models: we reused the data-mining dataset as the pilot-study training dataset and the pilot-study testing set was the MAQC-collected validation dataset. The sizes of the testing set, as well as the estimated performance and the associated uncertainties, are summarized in Table 2. ± 1.96SD Tr ,Ts . We can make a few important The approximate 95% CI of the estimated mean AUC was calculated by AUC observations from these results. First, the classification performance estimated in the pilot study is lower than that in the data-mining stage for all the three studies. This indicates that the internal validation alone (in the data-mining stage) is not sufficient and that it is important to perform a pilot assessment with an independent dataset. Second, the genomic classifiers show promise in discriminating between clinical endpoints. Third, the uncertainty of the performance estimate is large. The 95% CI of the AUC for the breast cancer study (0.53, 0.93) covers 80% of all possible AUC values (ranging from 0.5 to 1.0 for a non-trivial classifier). The 95% CI of the AUC for the multiple myeloma study contains the random-guess AUC value (0.5). Fourth, the training variability is comparable to the testing variability, which indicates that the models are not stable with varying training sets at the current training sample size. Ignoring the training variability would lead to an over-optimistic conclusion (i.e., underestimating the true variance of the performance estimate). 4. Discussion Although not explicitly emphasized in the popular literature, the concept of ‘‘training variability’’ is not totally new. For example, it is often stated that ‘‘leave-one-out cross-validation has large variance’’. This statement implies that, by
Author's personal copy 1026
W. Chen et al. / Computational Statistics and Data Analysis 56 (2012) 1016–1027
independently drawing samples (of the same size) from the population and applying the LOO-CV estimator, the estimated performance values are highly variable. One should note that both the training set and the testing set vary in this process and hence both contribute to the variance of the performance. However, it is difficult to estimate the variance that accounts for both training and testing with a single dataset for the commonly used CV estimators (Bengio and Grandvalet, 2004; Efron and Tibshirani, 1997). A major contribution of this paper is to demonstrate the feasibility of quantifying both training and testing variability of classifier performance on finite real-world datasets. It is worthwhile to discuss the bias–variance properties of various validation strategies: the conventional cross-validation strategies, the leave-pair-out bootstrap, and the independent validation with split sample. First of all, the optimistic resubstitution bias is avoided in all these strategies when properly used. However, different strategies may differ in another type of bias, which we call the learning curve bias, i.e., (under proper conditions) the smaller the training size the lower the mean performance (see Fig. 1). For example, for a dataset of size N, LOO-CV is nearly unbiased since it has a training size of N −1 whereas the half-half split-sample strategy would be biased low since it has a training size of N /2. The bootstrap sample of a training set of size N has an effective training size close to N /2 (Efron and Tibshirani, 1997), so the bootstrap based AUC estimators we used are biased low if the learning curve does not plateau. However, they have less variance by averaging over multiple (bootstrap) training sets. More importantly, the variance can actually be estimated with a single dataset. LOOCV and.632+ bootstrap estimators are nearly unbiased; however, these estimators are often associated with larger and difficult-to-estimate variance. In clinical applications, it is important to correctly estimate the variance of a performance estimator to draw meaningful conclusions. The bias–variance trade-off can be resolved by the mean square error (MSE). This trade-off is effected by the sample size n, the dimensionality p, and also the problem complexity (i.e., the intrinsic separability between the two populations or the classification difficulty). In the n ≫ p and high intrinsic separation (easy problem or low-hanging fruit) settings, different validation methods tend to converge. In the n ≪ p and low separation settings, it is generally accepted that variance dominates the MSE and substantial bias can thus be tolerated if it is accompanied by variance reduction. Comparison of different validation strategies in terms of bias, variance, and MSE can be achieved by means of Monte Carlo simulation studies (see, e.g., Chen et al., 2009; Kim, 2009; Hanczar et al., 2010). Our simulations (Chen et al., 2009) show that, in the n ≪ p and low separation settings, the LPOB AUC estimator may gain an advantage in terms of MSE as compared to the LOOCV AUC estimator. In another study (Popovici et al., 2010) comparing several of feature selection and classification models on the breast cancer dataset, it was also observed that LPOB estimates were closer to the validation performance than were the cross-validation estimates although no statistically significant differences among a variety of models were found due to the finite sample size. Although our uncertainty estimation techniques have been validated in simulation studies (Yousef et al., 2005, 2006; Chen et al., 2009), it is the first time to demonstrate their applications to real-world problems. A unique advantage of our methods is to account for both training and testing in the uncertainty estimation. This is important because, although the problem of classifier stability was realized (Michiels et al., 2005; Tibshirani, 2005), rigorous methods are rarely used in real-world applications to quantify the classifier stability with respect to varying training sets. Statistical inference of a population parameter (e.g., classifier performance on the population) is based on parameter estimation using a sample of subjects. The validity of the inference requires a fundamental assumption that the sample is randomlydrawn from and hence is representative of that population. However, in clinical practice, patient samples are often collected ‘‘by convenience’’ rather than by a prospective randomization. This fact has two important implications for research studies using such patient samples. First, one should be cautious of resampling (CV or bootstrap) a dataset that is combined from several datasets, which may have been collected under different conditions. Randomly resampling the combined datasets would lose the opportunity to detect potential systematic differences among them. For example, our Paradigm Two analysis in Section 3.1 showed that the 100 newly collected breast cancer patient sample may have a systematic difference from the 130 previously collected patient sample and such a difference would not be found if all the patients are pooled and randomly partitioned. Second, for a promising genomic classifier, it is important to identify the intended patient population and validate the classifier with an independent dataset that is collected with prospective randomization. We have restricted the demonstration of our uncertainty estimation techniques to a situation where a set of features had been selected and therefore the analysis were conditional on a fixed set of features. In principle, our methods can be extended to include the feature selection as part of the training. For example, in the LPOB technique, each bootstrap training set can be used first for feature selection and then for classifier training with the selected features. By doing this, the meaning of ‘‘classifier training’’ includes not only estimation of the weight parameters of the classifier but also feature selection. The variability is expected to be larger as it further includes the variability due to the variation of features. 5. Conclusions We applied our uncertainty estimation methods to three clinical microarray datasets: breast cancer, multiple myeloma, and neuroblastoma. Our results show that the classifier performance is associated with large variability and the estimated performance may change dramatically on different datasets. Moreover, the training variability is found to be on the same order of the testing variability for the datasets and models considered. In conclusion, we demonstrate the feasibility of quantifying both training and testing variability of classifier performance on finite real-world datasets. The large variability
Author's personal copy W. Chen et al. / Computational Statistics and Data Analysis 56 (2012) 1016–1027
1027
of the performance estimates shows that patient sample size is still the bottleneck of the microarray problem and the training variability is not negligible. Acknowledgments We gratefully dedicate this work in memory of Dr. Robert F. Wagner, who enthusiastically promoted the concept of ‘‘training variability’’ and inspired many of us until he passed away unexpectedly in June 2008. The authors gratefully thank Dr. André Oberthuer from the University of Cologne, Germany, for providing the neuroblastoma dataset, and Dr. John Shaughnessy Jr. from the University of Arkansas for Medical Sciences, USA, for providing the multiple myeloma dataset. Certain commercial materials and equipment are identified in order to adequately specify experimental procedures. In no case does such identification imply recommendation or endorsement by the FDA, nor does it imply that the items identified are necessarily the best available for the purpose. References Ambroise, C., McLachlan, G.J., 2002. Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS 99, 6562–6566. Andre, F., Mazouni, C., Liedtke, C., Kau, S., Frye, D., Green, M., Gonzalez-Angulo, A., Symmans, W., Hortobagyi, G., Pusztai, L., 2008. HER2 expression and efficacy of preoperative paclitaxel/FAC chemotherapy in breast cancer. Breast Cancer Res. Treat. 108, 183–190. Bamber, D., 1975. The area above the ordinal dominance graph and the area below the receiver operating characteristic curve. J. Math. Psych. 12, 387–415. Bengio, Y., Grandvalet, Y., 2004. No unbiased estimator of the variance of k-fold cross-validation. J. Mach. Learn. Res. 5, 1089–1105. Bradley, A.P., 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recog. 30, 1145. Chen, W., Wagner, R.F., Yousef, W.A., Gallas, B.D., 2009. Comparison of classifier performance estimators: a simulation study. In: Medical Imaging 2009: Image Perception, Observer Performance, and Technology Assessment, SPIE, Lake Buena Vista, FL, USA, pp. 72630X–11. Dave, S.S., Wright, G., Tan, B., Rosenwald, A., Gascoyne, R.D., Chan, W.C., Fisher, R.I., Braziel, R.M., Rimsza, L.M., Grogan, T.M., Miller, T.P., LeBlanc, M., Greiner, T.C., Weisenburger, D.D., Lynch, J.C., Vose, J., Armitage, J.O., Smeland, E.B., Kvaloy, S., Holte, H., Delabie, J., Connors, J.M., Lansdorp, P.M., Ouyang, Q., Lister, T.A., Davies, A.J., Norton, A.J., Muller-Hermelink, H.K., Ott, G., Campo, E., Montserrat, E., Wilson, W.H., Jaffe, E.S., Simon, R., Yang, L., Powell, J., Zhao, H., Goldschmidt, N., Chiorazzi, M., Staudt, L.M., 2004. Prediction of survival in follicular lymphoma based on molecular features of tumor-infiltrating immune cells. New England Journal of Medicine (November) 351, 2159–2169. DeLong, E.R., DeLong, D.M., Clarke-Pearson, D.L., 1988. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845. Efron, B., Tibshirani, R., 1997. Improvements on cross-validation: the.632+ bootstrap method. J. Amer. Statist. Assoc. 92, 548–560. Fawcett, T., 2006. An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S., 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537. Green, D.M., Swets, J.A., 1966. Signal Detection Theory and Psychophysics. Wiley, New York, (reprint, Krieger, New York, 1974)]. Hanczar, B., Hua, J., Sima, C., Weinstein, J., Bittner, M., Dougherty, E.R., 2010. Small-sample precision of ROC-related estimates. Bioinformatics 26, 822–830. Hanley, J.A., McNeil, B.J., 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36. Heagerty, P.J., Lumley, T., Pepe, M.S., 2000. Time-dependent roc curves for censored survival data and a diagnostic marker. Biometrics 56, 337–344. Hess, K.R., Anderson, K., Symmans, W.F., Valero, V., Ibrahim, N., Mejia, J.A., Booser, D., Theriault, R.L., Buzdar, A.U., Dempsey, P.J., Rouzier, R., Sneige, N., Ross, J.S., Vidaurre, T., Gomez, H.L., Hortobagyi, G.N., Pusztai, L., 2006. Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. J. Clin. Oncol. 24, 4236–4244. Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J., Scherf, U., Speed, T.P., 2003. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264. Jiang, W., Simon, R., 2007. A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification. Stat. Med. 26, 5320–5334. Kim, J.H., 2009. Estimating classification error rate: repeated cross-validation, repeated hold-out and bootstrap. Comput. Statist. & Data Anal. 53, 3735–3745. MAQC-II Consortium (Shi, L., et al.) 2010. The MAQC-II project: a comprehensive study of common practices for the development and validation of microarray-based predictive models. Nat. Biotechnol. 28, 827–838. Metz, C.E., 1978. Basic principles of ROC analysis. In: Seminars in Nuclear Medicine 8, pp. 283–298. Michiels, S., Koscielny, D., Hill, C., 2005. Prediction of cancer outcome with microarrays: a multiple random validation strategy. The Lancet 365, 488–492. Mukherjee, S., Tamayo, P., Rogers, S., Rifkin, R., Engle, A., Campbell, C., Golub, T.R., Mesirov, J.P., 2003. Estimating dataset size requirements for classifying dna microarray data. Journal of Computational Biology 10, 119–142. Oberthuer, A., Berthold, F., Warnat, P., Hero, B., Kahlert, Y., Spitz, R., Ernestus, K., König, R., Haas, S., Eils, R., Schwab, M., Brors, B., Westermann, F., Fischer, M., 2006. Customized oligonucleotide microarray gene expression-based classification of neuroblastoma patients outperforms current clinical risk stratification. J. Clin. Oncol. 24, 5070–5078. Pepe, M.S., 2003. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press, Oxford, United Kingdom. Popovici, V., Chen, W., Gallas, B., Hatzis, C., Shi, W., Samuelson, F., Nikolsky, Y., Tsyganova, M., Ishkin, A., Nikolskaya, T., Hess, K., Valero, V., Booser, D., Delorenzi, M., Hortobagyi, G., Shi, L., Symmans, W., Pusztai, L., 2010. Effect of training sample size and classification difficulty on the accuracy of genomic predictors 12, R5. Shaughnessy, J., Zhan, F., Burington, B., Huang, Y., Hanamura, I., Stewart, J., Kordsmeier, B., Randolph, C., Williams, D., Xiao, Y., Xu, H., Epstein, J., Anaissie, E., Krishna, S., Cottler-Fox, M., Hollmig, K., Mohiuddin, A., Pineda-Roman, M., Tricot, G., van Rhee, F., Sawyer, J., Alsayed, Y., Walker, R., Zangari, M., Crowley, J., Barlogie, B., 2007. Blood 109, 2276–2284. Simon, R., Radmacher, M.D., Dobbin, K., McShane, L.M., 2003. Pitfalls in the use of dna microarray data for diagnostic and prognostic classification. J. Natl. Cancer. Inst. 95, 14–18. Tibshirani, R., 2005. Immune signatures in follicular lymphoma. N. Engl. J. Med. 352, 1496–1497. author reply 1496–7. van ’t Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T., Schreiber, G.J., Kerkhoven, R.M., Roberts, C., Linsley, P.S., Bernards, R., Friend, S.H., 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536. Yousef, W.A., Wagner, R.F., Loew, M.H., 2005. Estimating the uncertainty in the estimated mean area under the ROC curve of a classifier. Pattern Recogn. Lett. 26, 2600–2610. Yousef, W.A., Wagner, R.F., Loew, M.H., 2006. Assessing classifiers from two independent data sets using ROC analysis: a nonparametric approach. Pattern Anal. Mach. Intell., IEEE Trans. 28, 1809–1817.