Estimating Cross-Validation Variability

5 downloads 4158 Views 117KB Size Report
sification performance that is estimated using the Monte Carlo K-fold Cross Validation. (MKCV) procedure. In MKCV, the data is partitioned into K exclusive folds ...
Section on Statistics in Epidemiology – JSM 2009

Estimating Cross-Validation Variability Waleed A. Yousef∗

Weijie Chen†

Abstract We propose a novel influence-function-based approach to estimating the variance of classification performance that is estimated using the Monte Carlo K-fold Cross Validation (MKCV) procedure. In MKCV, the data is partitioned into K exclusive folds and the classifier is trained with K − 1 folds of the data and then tested on the remaining one fold to estimate the performance; this process is repeated M times by shuffling the data each time and the M performance values are averaged to obtain the final performance estimate. A naive variance estimator is to use the sample variance of the M performance values, which does not account for the correlations and hence is biased. Our preliminary simulation results show that our new approach is promising. Key Words: Classifiers, Cross-Validation, K-fold Cross Validation, Repeated Cross Validation, Variance Estimation, Assessment.

1. Introduction Cross Validation (CV) is used extensively in statistical learning to estimate the performance of a classification rule. Although the CV approach can provide a nearly unbiased estimate of the performance, the estimation of its uncertainty is generally known to be difficult Bengio and Grandvalet (2004); Isaksson et al. (2008); Markatou et al. (2005). A general naive estimate of performance variance is to use the sample variance of the performance values obtained in the CV procedure. In Monte Carlo Kfold Cross Validation (MKCV), for example, the data is partitioned into K exclusive folds and the classifier is trained with K − 1 folds of the data and then tested on the remaining one fold to estimate the performance; this process is repeated M times by shuffling the data each time and the M performance values are averaged to obtain the final performance estimate. A naive approach is to use the sample variance of the M performance values as a variance estimator. This approach does not account for the correlations and hence can be substantially biased. In this paper, we propose a novel influence-function-based approach to estimating the variance of classification performance that is estimated using the MKCV procedure. We will summarize several CV variants in a uniform notation in section 2.1 and then provide their naive variance estimators in section 2.2. We will then ∗

Computer Science Department, Faculty of Computers and Information, Helwan University, Egypt † NIBIB/CDRH Joint Laboratory for the Assessment of Medical Imaging Systems, Division of Imaging and Applied Mathematics, OSEL, CDRH, FDA

3318

Section on Statistics in Epidemiology – JSM 2009

focus on MKCV and provide the theoretical basis for our new variance estimator of MKCV in section 2.3. We present our preliminary simulation results in section 3 followed by a discussion of future work in section 4. 2. Methods 2.1

Which Version of Cross Validation?

The concepts of different variants of cross validation in the general parameter estimation problem were introduced by Burman (1989); and CV estimators are nowadays extensively used in the estimation of classification performance Hastie, Tibshirani and Friedman (2001); Duda, Hart and Stork (2001); Bishop (2006). We summarize several CV variants in the classification context in (1), using error rate as a performance measure. n  (KCV ) 1X d Err = Q xi , X({K(i)}) , n i=1 " , # n X X  (RKCV ) 1 d = Q xi , X({Km (i)},m) Err M , n m i=1 " # , X X X  (M KCV ) 1 d Err = Iim Q xi , X({1},m) Iim , n m m

(1a) (1b) (1c)

i

where X is the whole dataset with n observations, X(·) is the training dataset after excluding the testing folds in brackets, Q is the 0–1 loss function, I is the indicator function for indicating whether the ith observation belongs to the testing fold in mth repetition or not, and Km (i) maps the ith observation in the mth repetition to a particular CV fold with total number of M repetitions. The estimator (1a) is the conventional K-fold CV. The estimator (1b) is the so-called repeated K-fold CV, which is an average over repetitions of KCV; each repetition is done after shuffling the data. The estimator (1c) is the so-called Monte Carlo CV, which is similar to the RKCV except that from each repetition testing is done only on the first of the K partitions. We also defined the corresponding CV estimators for the Area Under the ROC Curve (AUC) in a similar fashion. Those formulae are not provided because the focus here is the concept of CV rather than the performance metric. Efron and Tibshirani (1997) alluded to the fact that a performance estimator based on cross validation is not a “smooth statistic” and its variance is hard to estimate with a single dataset. In the vernacular, the smoothness property means that a little variation in one observation should result in a little variation in the statistic. They proposed a smooth estimator of error rate called LOO bootstrap and further derived an IF based estimator to estimate its variance.

3319

Section on Statistics in Epidemiology – JSM 2009

Yousef, Wagner and Loew (2005) followed the same route to derive an IF based variance estimator for the Leave-Pair-Out bootstrap estimator of the AUC. Should the reader need more formalisms one of the following should be visited: Hampel (1974, 1986); Huber (2004). The KCV is not a smooth statistic because any small change in one observation does not lead to small change in the kernel Q. However, the other two versions, MKCV and RKCV are smooth since the testing of each observation is averaged over many training sets. This averaging is done by the summation Σm (see (1b) and (1c)). This averaging smoothes the quantity inside the square brackets and makes it suitable for differentiation. 2.2

Some ad hoc Methods of Estimating the variance of KCV

A naive variance estimator for KCV, as analyzed by Bengio and Grandvalet (2004), is  !2    K K X X 1 d Err d (KCV ) = 1  1 errk  , (2) Var errk − K K −1 K k=1

k=1

where errk is the error rate observed from testing on the kth fold, is one version of what many people use for estimating the variance of the KCV. This is the same form of the very well known Uniformly Minimum Variance Unbiased (UMVU) estimator of the variance but it is NOT the UMVUE since the K values of errk are not independent. This is obvious since every pair erri , errj , i 6= j share K − 1 folds in the training set. It is worth mentioning that some professionals in the field, e.g., Hastie, Tibshirani and Friedman (2001), use (2) in selecting classifiers in the design stage based on their relative values, not to give an absolute estimate. In the case of RKCV, another ad hoc estimator can be R

KCV

X d RKCV = 1 d KCV , Var Var r R r=1

(3)

d where Var is the L.H.S. of (2) in one repetition. For the MKCV an ad hoc r version is given by  !2  M M X X 1 d M KCV = 1  1 Var err1m − err1m  , (4) M M −1 M m=1

m=1

where err1m is the error rate produced by testing on the first partition from repetition m. This version, again, assumes the testing folds are independent, which is not true. 2.3

Estimating CV Variance Using Influence Function

Here we provide an introduction of the influence function approach based on Hampel (1974); Efron and Tibshirani (1997); Yousef, Wagner and Loew (2005), which serves

3320

Section on Statistics in Epidemiology – JSM 2009

as the theoretical basis of our variance estimator for MKCV. Consider any realvalued functional s defined on the distribution F , which is not necessarily discrete. Perturbing F , by adding some probability measure on a point x results in a new probability distribution Gε,x given by: Gε,x = (1 − ε)F + εδx ,

x∈R

(5)

The influence curve (Hampel, 1974) ICs,F (·) is defined as: ICs,F (x) = lim

ε→0+

s[(1 − ε)F + εδx ] − s(F ) ε

(6)

Assume that there is a distribution G “near” to the distribution F ; then under some regularity conditions (see, e.g., Huber, 1996, Ch. 2) a functional s can be approximated as: Z s(G) ≈ s(F ) +

where

Z

ICs,F (x)dG(x),

ICs,F (x)dF (x) = 0,

and the asymptotic variance of s(F ) is given by: Z varF (s(F )) = {ICs,F (x)}2 dF (x),

(7)

(8)

(9)

which can be considered as an approximation to the variance under a distribution G near to F . Now, assume that s is a functional statistic in the data set x = {xi : xi ∼ F, i = 1, 2, · · ·, n}. In that case the influence curve (6) is defined for each observation xi , under the true distribution F , as: s(Fε,i ) − s(F ) ∂s(Fε,i ) Ui (s, F ) = lim = (10) ε→0 ε ∂ε ε=0 where Fε,i is the distribution under the perturbation at observation xi . This previous equation is called the influence function. If the distribution F is not known, the MLE Fˆ of the distribution F is given by 1 Fˆ = on ti ∈ ω1 , i = 1, 2, ..., n1 , n1

(11)

and as an approximation Fˆ may be substituted for F in (10), the result of which is the so called “empirical influence function”. Therefore, the perturbation defined in (5) can be rewritten as: Fˆε,i = (1 − ε)Fˆ + εδxi , xi ∈ x, i = 1, 2, ..., n;  1−ε n + ε, j = i ˆ fε,i(xj ) = 1−ε n , j 6= i

3321

(12) (13)

Section on Statistics in Epidemiology – JSM 2009

Substituting Fˆ for G in (7) and combining the result with (10) gives n

s(Fˆ ) ≈ s(F ) +

1X Ui (s, F ). n

(14)

i=1

Then the asymptotic variance expressed in (9) can be given for s(F ) by:   1 EF U 2 (xi , F ) n n 1 X 2 ≈ 2 Ui (xi , Fˆ ). n

varF (s) =

(15) (16)

i=1

d Equation (16) can be used as an estimate Var(s) to the variance of s.

Equation (16) gives the nonparametric estimate of variance for a statistic s under the empirical distribution Fˆ . Calculating Ui of the MKCV estimator then substituting back in (16) gives an estimate of the variance of the CV. 3. Simulation Study Here we present our preliminary experiments to evaluate our new IF-based method of estimating the variance of the MKCV estimator. We used the Linear Discriminant Analysis (LDA) and the Quadratic Discriminant Analysis (QDA) classifiers with training size of 10, 20, 40, 60 per class sampled from normal distributions with 2 and 4 dimensions. We sampled observations from two normal distributions with identity covariance matrices and zero mean vector and a mean vector with each element being c, respectively, where c is adjusted for class separability. We tried different versions of CV, namely RKCV and MKCV. In all experiments we repeated the CV (either in RKCV or MKCV) 1000 times. Certainly, RKCV converges much faster than MKCV (typically 100 repetition is adequate). We kept both numbers of repetitions equal for the sake of comparison. In general, RKCV and MKCV produce almost the same estimate of AUC (as expected) with the same MC variance. We, therefore, only show MKCV results. Regarding variance estimation, the IF-based method of the MKCV is downwardly biased with small variance. The bias decreases inversely with K for the same n. Roughly, for K = 5 we have a good compromise between the bias of the estimate and the bias of its variance estimation. We present some of our simulation results in tables 1–2. These tables compare the IF-based estimator with the estimator (4). From the tables it appears that the largest bias of IF-based estimator is exercised for the small sample size. The reason is related to the smoothness issue; e.g., for the case of n = 10 and K = 5 the number of training set permutations is only 10 2 = 45. This is challenging for the IF method. Our preliminary results show

3322

Section on Statistics in Epidemiology – JSM 2009

that the ad hoc estimator (4) is biased downwardly. However, we noticed that if we scale the estimator by a factor M/K, i.e., normalize by K (rather than M ) outside the brackets in (4), then we would get a much better estimator, which, in our preliminary simulations, yields variance estimates that are close to the true variance and comparable to our IF method (data not shown). More theoretical and experimental investigations are needed to evaluate this empirical finding. 4. Future Work The present paper summarizes the results of our preliminary simulation experiments. More comprehensive simulations are underway in our group to compare our IF based method with the ad hoc methods for variance estimation in cross validation.

3323

Classifier p K AU C CKV M [ AU C CKV M

h i d IF SD SD 3324

p K AU C CKV M [ AU C CKV M

[ SDh AU Ci d IF E SD "q # M KCV d E Var

h i d IF SD SD

LDA 2 5 .7657

LDA 2 2 .7628

LDA 4 10 .7437

LDA 4 5 .7427

LDA 4 2 .7420

QDA 2 10 .7226

QDA 2 5 .7273

QDA 2 2 .7255

QDA 4 10 .6607

QDA 4 5 .6601

QDA 4 2 .6598

.7718

.7737

.7437

.7260

.7278

.6793

.7339

.7207

.6757

.6492

.6362

.5732

.1263

.1263

.1300

.1474

.1379

.1269

.1472

.1434

.1312

.1655

.1504

.0925

.0967

.0951

.1029

.1017

.1000

.1044

.1016

.1016

.1045

.1095

.1056

.0867

.01303

.0078

.00437

.01378

.0084

.00497

.01366

.0084

.0049

.01468

.0090

.0057

.0264

.0265

.0292

.0246

.0259

.0237

.0275

.0263

.0228

.0222

.0208

.0148

n1 = n2 = 10 2 10 .7825

2 5 .7822

2 2 .7822

4 10 .7743

4 5 .7758

4 2 .7740

2 10 .7616

2 5 .7601

2 2 .7608

4 10 .7188

4 5 .7205

4 2 .7189

.7945

.7919

.7831

.7684

.7676

.7394

.7705

.7621

.7387

.7080

.7013

.6565

.0772

.0715

.0812

.0835

.0886

.0870

.0869

.0933

.0918

.1084

.1021

.0897

.0682

.0691

.0715

.0715

.0716

.0765

.0715

.0727

.0771

.0772

.0781

.0789

.00783

.0049

.00258

.00820

.0051

.0029

.00816

.0052

.00292

.00885

.00557

.00343

.0127

.0115

.0147

.0124

.0126

.0131

.0126

.0127

.0129

.0114

.0120

.0095

n1 = n2 = 20

Table 1: IF and ad hoc estimators under different configurations (classifier, dimensionality p, number of folds K, and sample size n1 and n2 )

Section on Statistics in Epidemiology – JSM 2009

[ SDh AU Ci d IF E SD "q # M KCV d Var E

LDA 2 10 .7679

Section on Statistics in Epidemiology – JSM 2009

Classifier p K AU C CKV M [ AU C CKV M

[ SDh AU Ci d E SD IF "q # M KCV d E Var

h i d IF SD SD

LDA 4 10 .7956

LDA 4 5 .7953

LDA 4 2 .7954

QDA 4 10 .7741

QDA 4 5 .7743

QDA 4 2 .7746

.7941

.7936

.7853

.7718

.7651

.7460

.0433

.0427

.0444

.0492

.0511

.0512

.0401

.0403

.0417

.0422

.0432

.0464

.00405

.00263

.00137

.00422

.00275

.00153

.0040

.0040

.0045

.0042

.0044

.0046

n1 = n2 = 60

Table 2: IF and ad hoc estimators under different configurations (classifier, dimensionality p, number of folds K, and sample size n1 and n2 )

3325

Section on Statistics in Epidemiology – JSM 2009

References Bengio, Yoshua and Yves Grandvalet. 2004. “No Unbiased Estimator of the Variance of K-Fold Cross-Validation.” J. Mach. Learn. Res. 5:1089–1105. Bishop, Christopher M. 2006. Pattern recognition and machine learning. Information science and statistics New York: Springer. Burman, Prabir. 1989. “A comparative study of ordinary cross-validation, µfold cross-validation and the repeated learning-testing methods.” Biometrica 76(3):503–514. Duda, Richard O., Peter E. Hart and David G. Stork. 2001. Pattern classification. 2nd ed. New York: Wiley. Efron, Bradley and Robert Tibshirani. 1997. “Improvements on Cross-Validation: The .632+ Bootstrap Method.” Journal of the American Statistical Association 92(438):548–560. Hampel, Frank R. 1974. “The Influence Curve and Its Role in Robust Estimation.” Journal of the American Statistical Association 69(346):383–393. Hampel, Frank R. 1986. Robust statistics : the approach based on influence functions. Wiley series in probability and mathematical statistics. New York: Wiley. Hastie, Trevor, Robert Tibshirani and J. H. Friedman. 2001. The elements of statistical learning : data mining, inference, and prediction. Springer series in statistics New York: Springer. Huber, Peter J. 1996. Robust statistical procedures. CBMS-NSF regional conference series in applied mathematics ; 68 2nd ed. Philadelphia: Society for Industrial and Applied Mathematics. Huber, Peter J. 2004. Robust statistics. Wiley series in probability and statistics Hoboken, N.J.: Wiley-Interscience. Isaksson, A., M. Wallman, H. G?ransson and M. G. Gustafsson. 2008. “Crossvalidation and bootstrapping are unreliable in small sample classification.” Pattern Recognition Letters 29(14):1960–1965. Markatou, Marianthi, Hong Tian, Shameek Biswas and George Hripcsak. 2005. “Analysis of Variance of Cross-Validation Estimators of the Generalization Error.” J. Mach. Learn. Res. 6:1127–1168. Yousef, Waleed A., Robert F. Wagner and Murray H. Loew. 2005. “Estimating the Uncertainty in the Estimated Mean Area Under the ROC Curve of a Classifier.” Pattern Recognition Letters 26(16):2600–2610.

3326

Suggest Documents