The Problem of Cross-Validation Averaging and Bias, Repetition and Significance Adham Atyabi2
David M W Powers1,2 1
Beijing Municipal Lab for Multimedia & Intelligent Software Beijing University of Technology Beijing China
[email protected]
Abstract— Cross-Validation (CV) is the primary mechanism used in Machine Learning to control generalization error in the absence of sufficiently large quantities of marked up (tagged or labelled) data to undertake independent testing, training and validation (including early stopping, feature selection, parameter tuning, boosting and/or fusion). Repeated Cross-Validation (RCV) is used to try to further improve the accuracy of our performance estimates, including compensating for outliers. Typically a Machine Learning researcher will the compare a new target algorithm against a wide range of competing algorithms on a wide range of standard datasets. The combination of many training folds, many CV repetitions, many algorithms and parameterizations, and many training sets, adds up to a very large number of data points to compare, and a massive multiple testing problem quadratic in the number of individual test combinations. Research in Machine Learning sometimes involves basic significance testing, or provides confidence intervals, but seldom addresses the multiple testing problem whereby the assumption of p 50% and Accuracy = Precision = Prevalence > 0 [1]. This has led to several proposals for unbiased measures based on ROC [2], Informedness based on implicit costs of errors [2,3,4], Kappa [4,5,6], Correlation [7] or equivalences with Psychological measures such as DeltaP’ [7,8], with most of these studies actually showing the equivalence of some of these measures under various conditions. However, all these measures can be seen to be equivalent for Bias = Prevalence. Under these conditions of equality, Recall = Precision and some of the discrepancies, errors and traps disappear. One of the issues that arises with Cross-Validation is how to average across different folds and/or repetitions, and in particular different results may be obtained by different approaches based on Micro-Averaging or Macro-Averaging. For example, with F-measure, one can Macro-Average the individual F-measure results, or one can Macro-Average the individual Precision and Recall results then take the harmonic mean to estimate F-measure, but neither of these approaches is correct [9]. Micro-Averaging to produce a new contingency table will allow correct assessment of all contingent statistics, but ROC and other curve analyses are more complex again and merging of curves is an additional possibility but depends on understanding (calibrating) the relationships to thresholds [9].
Computing Baseline performance for a chance level classifier is one way of allowing more comparison of performance of different algorithms and variations, but does not help with comparing across datasets with different distributions. The above chance-corrected measures effectively subtract off the baseline chance performance and renormalize to probability producing a Kappa statistic, but the other measures achieve a similar and more robust effect in more principled and less ad hoc way. Kappa was designed to compare people, not classifiers, and many different variants exist, with certain relationships understood between them [10], with Cohen Kappa being the one most commonly used as a Machine Learning (ML) evaluation measure, and most directly comparable to ROC AUC, Correlation, Informedness and DeltaP’ [2-8], assuming specified Bias and Prevalence [19]. A related issue is that many classifiers minimize error, which equivalently maximizes Accuracy. Are they appropriate for these other measures? For a given Bias:Prevalence relationship, Cohen Kappa subtracts off and renormalizes by division by fixed quantities, so is maximized when Accuracy is maximized. Single-point ROC AUC = (Informedness+1)/2 = (DeltaP’+1)/2, and for Bias=Prevalence, SpROC AUC = (Kappa+1)/2 [19.20]. Where Bias:Prevalence differs from 1:1 it is equivalent to a different cost weighting (or skew) rather than assuming that correctly and incorrectly labelled Positives have the same cost. This specifies the slope of, and thus changes, the cost-neutral line to be different from the chance line in the ROC plot [2,7]. Weighted Relative Accuracy (WRAcc) is a skew-related constant times Informedness or DeltaP’ (which are equivalent to a skew-insensitive form of WRacc) [2,7]. WRAcc is weighted to counteract the effect of chance, but Accuracy is only guaranteed optimal if Bias=Prevalence or equivalently true and false positives have the same cost, and thus this is an essential constraint for systems that maximize Accuracy (this closes an open question as to whether Bias = Prevalence is a necessary and sufficient condition for optimal parameterization of a given learner – it is a necessary condition where positive and negative costs are equal and our learner optimizes in terms of Accuracy or Error; it is sufficient in that any system on a cost isometric is equally good [2]). B. Macro-Averaging Informedness generalizes to the multiclass case as a biasweighted average of the individual cases – bias because it is the labels we predict that are under our control. Weighting (MacroAveraging) over the prevalence of different classes is only valid for the corresponding Recalls (or equivalently the True Positive/Negative/Label Rates) and it then corresponds to Rand Accuracy. The Macro-Averages of ROC AUC, Precision, Kappa and F-Factor over class prevalence is statistically meaningless, although it is suggest [9] that Macro-Average over ROC AUC is as good as we can do due to the varying biases and uncalibrated thresholds. This is something of a problem for toolboxes like WEKA [11] or RapidMiner [12] that allow routine display of Macro-Averages of every statistic calculated. C. Cross-Validation The problem of averaging not only affects statistics collected for individual class labels, but the combination of statistics for different cross-validation folds or repetitions, and
this is in fact the context of some of the previous discussion [9]. If different folds have different prevalences (in the absence of stratification) and different bias points (based on the setting of thresholds and other parameters), we are comparing apples with oranges and averaging of the traditional statistics is meaningless – Precision, Recall and Accuracy can only be meaningfully compared under conditions of constant Prevalence and Bias. Informedness and Kappa remove this bias and whilst Kappa is based on Accuracy and has a mixed dependence on Bias and Prevalence, Informedness or DeltaP’ (and as a consequence ROC AUC) are well defined probabilities when MacroAveraged over predicted label, and its dual Markedness or DeltaP can be Macro-Averaged over actual classes [7]. There are many different learning algorithms and some of these have natural threshold-like parameters or implicit probabilistic outputs. For example a pruned decision tree will in general have a mix of Positive and Negative instances in each node, and a KNN learner will have a similar mix in each neighbourhood. These can be given a direct probabilistic interpretation, and a varying probability threshold can thus give rise to a ROC analysis. Similarly Neural Networks (including Single Multilayer Perceptrons, Extreme Learning Machines and Linear and non-Linear Support Vector Machines) generally have a transfer function that is trained in a way that will tend to make Positive cases positive, Negative cases negative, and allow a hard limit transfer function with a specified bias – this bias is a parameter that will directly control the Bias of the predictor and generate a ROC analysis. Moreover, by integrating the numbers of examples declared positive at each threshold value we not only get an AUC measure, we have the ability to normalize the varied parameter to directly give us a probability – so that bias = Bias. D. Optimizing for Orthogonality The previous discussion has implicitly assumed we are trying to optimize some measure of accuracy or informedness, but training of multiple models creates new opportunities and new pitfalls, whether trained and tested on multiple folds, multiple subsets of the features, or multiple algorithms or parameterizations. Rather than aiming to be close to perfect and different from chance, and measuring this, we can also seek a Pareto tradeoff between working well and being different. The methods we eliminate as suboptimal should be considered to see whether they are strongly correlated or contain additional information that may be fused into create a better classifier than any of the separate ones. Distributions and Significance For any experimental research, including Machine Learning, it is important to understand whether our apparent results are real effects or due to chance. This question is complicated by the different measures we may use to assess our results, the fact that we may be varying multiple parameters (including the algorithm or the features used), and the fact that data is used to train the model and then is not available for independent testing (so that techniques like Bootstrapping or Cross-Validation involve us actually testing multiple variants of what is meant to be a single model). Significance testing is not about how good a model is, but how different it is – classically how different from a baseline chance or null model, but potentially whether it is beyond the range of chance variation of an alternate model according to the Normal
assumption is made for Informedness and thus variance = expectation = pN for both the single point and general versions of ROC AUC. The chance corrected measures are thus assumed to obey a Poisson distribution with the well defined probabilities (Informedness, Markedness, Coefficient of Determination = Squared Matthews Correlation – being the joint probability of Informedness and Markedness [7,13]) determining the arrival of informed knowledge versus a guess, which itself has a random Bernoulli distribution, and for expected Informedness or Guessing rate p we also model both expectation and variance as pN. The Poisson assumption is normally regarded as a reasonable approximation for large N (100 or more) and rare occurrences when we may assume expected count and variance both to be modelled adequately by pN. The ‘rare’ occurrences are the individual cases which given lack of duplicates occur with probability around 1/N and are then diagnosed in an Informed or Guessing fashion with probability p. With N such events we have our expectation pN, and this applies whether we are counting in terms of the hidden variable ‘Informed decisions’ or the hidden variable ‘Guessing decisions’, which are related by pÅ1-p, with pN and (1-p)N determining variance versus distributions of either rare Informed decisions or rare Guesses where p is close to 0 or 1. It is evidently not such a good decision for measures that are not chance corrected with chance expectations in the midrange (e.g. ½ for equal prevalence and/or bias).
distribution (if assumed to result from a multitude of variables), the Student distribution (if mean is estimated from a sample – see Table I) or for any other model appropriate to small N or few degrees of freedom (DoF). The simple uncorrected single class distributions (Precision, Recall) reflect a coin toss probability model of decision making and error reflected in the Bernoulli distribution, with mean p and variance p(1-p). The averages of these (Rand Accuracy, Fmeasure) reflect a Poisson Binomial distribution as the individual distributions can in general differ and the distinct Bernoulli variances need to be averaged accordingly. Pointwise ROC AUC = Informedness/2+1/2 and general ROC AUC is thus an integral of a linear function of Informedness each of which has, in general, a different value pθ at each different threshold θ and so is Poisson distributed. However, one feature of ROC AUC is that it is not integrated over threshold [4] but false positive rate (Fallout). Moreover it is inconvenient to integrate unless a Poisson or similar
In CxK-CV we have multiple folds with associated models (K) and multiple repetitions (C) testing across the full data with different pseudorandom allocations of instances to training and test folds, and thus in general a different set of models being evaluated. At the level of models, we thus have CxK models and associated test folds that represent the performance of a particular algorithm on a particular dataset. We then need to look at the difference between algorithms across each of these models and folds. When the dataset and partitionings, and the numbers of folds and repetitions, are the same we can compare the differences in a pairwise fashion. For the first K-CV, the standard error in the difference of mean of the individual distributions can be calculated straightforwardly from its mean μd and variance σd2 as εd = σd/√K, and the t-test compares statistic td = μd /εd against Student’s distribution with K-1 degrees of freedom. For C repetitions, viz. CxK-CV, the standard error is estimated by the corrected resampled t-test as εd = σd/√H where H is a Harmonic sum of K and C, H = 1/[1/C+1/K] and the t-test compares statistic td = μd /εd against Student’s distribution with CK-1 degrees of freedom [14,15,11:157-159]. For improved repeatability of results and consistency across folds, common seeding and stratification is recommended so that all algorithms train using the same pseudorandom allocations to test and training folds, and all folds reflect the overall class prevalences. This introduces pairing across both folds and repetitions and makes the test unnecessarily conservative. We restore the power by redefining H as the Harmonic mean of K and C, H= 2/[1/C+1/K]. We do not address the case where results from independent experiments are compared, and have different values for C, K and N.
II. MULTIPLE TESTING So far we have considered a single variable that represents a dichotomous variant versus an alternate hypothesis. This may be a baseline or null hypothesis, or it may be the putative best current solution to the problem. If we are seeking a better alternative, we will design a priori with a one-sided distribution. If we want to have an each way bet, or are agnostic about which is expected to be better, we may use an ad hoc two-sided distribution. But what if we have many algorithms or a whole family of parametric variants of a single algorithm? This is a major area of research for which there are no entirely satisfactory solutions. Tukey HSD, Tukey-Kramer, False Discovery Rate, Family-Wise Error and Scheffé, are some of the favourites, in order of generality. Given our seeded stratification and pairwise matching assumptions, and specific pairs of algorithms and parameterizations being tested, we will assume Tukey HSD which is a straightforward modification in which σd is replaced by the standard deviation of the entire design σe and the largest differences are tested first for efficiency. If multiple testing is ignored the apparent significances in pairwise testing are likely to be invalid as the significance level specifies the probability, and thus determines the expected number, of tests that appear significant by chance. However, an ANOVA across the full dataset gives us an indication of overall significance. TABLE II. ANOVAS FOR FIBONACCI PARADIGM SHOWING LACK OF SIGNIFICANCE OF CHOICE OF CROSS-VALIDATION PARADIGM
Source
SumSq.
dof.
MeanSq.
F
Prob>F
Algorithms
1.5343
49
0.03131
110.16
0.0000
CV paradigms
0.0004
2
0.00022
0.78
0.4577
Subjects
13.1952
4
3.29879
11605.36
0.0000
Algorithms * CV paradigms
0.0353
98
0.00036
1.27
0.0618
Algorithms * Subjects
1.1316
196
0.00577
20.31
0.0000
CV paradigms* Subjects
0.0018
8
0.00022
0.79
0.6106
Algorithms
7.3916
49
0.15085
1114.61
0.0000
CV paradigms
0
2
0.00001
0.04
0.9573
Subjects
6.6591
4
1.66478
12300.97
0.0000
Algorithms * CV paradigms
0.0208
98
0.00021
1.57
0.0014
Algorithms * Subjects
3.0098
196
0.01536
113.46
0.0000
CV paradigms* Subjects
0.002
8
0.00025
1.87
0.0629
Expt2a
Expt2b
III.
MINIMAL TESTING
With multiprocessors and supercomputers, the temptation is to increase the number of variations we explore, running multiple large CxK-CV in parallel. The parallelization means that they can finish before we have really thought about what we do next, and so we try more and more half-baked ideas. The many collections of results obtained then create a massive problem in terms of the sheer size of the analysis, as well as a specific problem in Multiple Testing. Ensemble, Boosting, Feature Selection, Evolutionary, Genetic, Swarm and Colony algorithms provide new paradigms for exploring the feature and parameter space, but are not immune to the Multiple Testing problem, which due to the basic assumption of all pair comparison grows quadratically with the number of algorithms (inc. feature selections, parameterizations). The first observation is that due to the Harmonic sum of C and K, C ~ K gives an approximate upper bound for the effective range of C, as H Æ K as C Æ ∞. By the time DoF=CK-1 is in the range 60 to 100, the Student distribution is within a few percent of normal (Table I), so any gain in power must derive from a large N, must relate to variation in the training sets, which have an overlap of C-1/K and even more across repetitions, and will approach a limit in C. Little value is expected beyond 10x10CV, but 5x2 and 2x5 CV are evidently much too small to give good repeatability [15]. In our empirical study we use 10x20CV as an upper bound for our explorations, but 8x8CV is not expected to be significantly worse based on these considerations. We also note in Table I that we conventionally use alphas as limits on p-values that go up in multiples of 2 or 2.5, whilst we use a Fibonacci sequence (approaching an exponent of φ) to explore successive folds (up to 20) or repetitions (above 20). We have undertaken experiments in which we consider the whole 10x20CV versus significance stopping after a Tripling sequence of repetitions {1, 3 and 9} (not counting the initial 20CV) for 10x20CV), as well as significance stopping after the above Fibonacci sequence of folds (until we truncate at 20, with DoF=19) and then continuing with a Fibonacci sequence of repetitions (of 20 folds each). The data set used is left/right “thought command” classification from the Brain Computer Interface Competition BCI IVa [20] and the parameter/feature space explored follows the manipulations of BCI IVa epochs described in [17] but explores different methods for weighing subwindows [18]. Algorithms explored were SVM, ELM and Linear Perceptron with early stopping, seeking to minimize error or maximize accuracy. SVD and FFT were explored as preprocessing steps. All implementation was in Matlab and runs were performed both on a 260-core Xeon-based multiprocessor, and on a network of SPARC and x86 processors. All results are examined and displayed in terms of Informedness to be comparable across different multiclass datasets and subproblems.
IV. RESULTS Example ANOVAs are presented in Table II for subwindowing experiments 2a and 2b from [18] under Standard, Tripling and Fibonacci CxK-CV. All experiments were carried out in Matlab and show no significant difference due to the abbreviated Cross Validation paradigm used. Significance was evaluated and CV folds or replications pruned using the two variant methods described in Section III, based on variants of the Student t-test. The results from the unpruned, 1/3/9-repetition pruned and Fibonaccifold+repetition-pruned variants were assessed for difference using an ANOVA followed by Tukey HSD. Differences amongst subjects/algorithms were significant to p>0.4 and p>>0.8 for Tripling and Fibonacci respectively. In Experiment 2a, the Tripling method saved 45.1% of training+testing runs, aborting 58.8% of CVs, by early stopping just CV repetitions while the Fibonacci fold+repetition pruning method saved 75.7% of runs and cut short 81.6% of KCVs, with 46.4% being aborted within the initial KCV. In Experiment 2b, the Tripling method saved 50.7% of runs, aborting 66.4% of KCVs, by early stopping just CV reps, but Fibonacci saved 81.3% of runs and cut short 86.8% of KCVs, with 50.8% being aborted within the initial KCV. Total number of possible runs in each experiment was 50000 (10x20CV x 5subj x 5alg). CONCLUSION Whilst it is not possible in a few pages to give a complete tutorial covering the wide range of basic and advanced statistics discussed here, there are several take home messages from this short paper. It is hoped that these will lead Machine Learning researchers to explore these matters further, rather than just to apply the recommendations as black box approaches. The paper addresses three main issues: 1. Many experiments are being run without clear hypotheses and without awareness of the multiple hypothesis testing issues; 2. Many experiments are being run with accuracy statistics that are not chance corrected, when they should be using Informedness, Correlation or Kappa-like measures; and 3. Many experiments are being run with repeated cross-validation paradigms that are overkill, rather than using the significance to guide where effort is likely to be of value, as per the Tripling or Fibonacci tests. This paper provides guidelines as to how to improve both the efficiency and efficacy of multifactor Machine Learning explorations. The Tripling approach saves around 50% of training/testing runs by significance testing after repetitions 1/3/9 (excluding initial 20-CV). The Fibonacci approach saves over 75% of runs by testing at Fibonacci intervals for both folds and reps. Our discussion of the Student distribution across the range of possible CxK-CV paradigms (Table I) and the 50% reduction for both folds and repetitions (Table II) suggests 10x20CV is about four times too much. On the other hand 2xKCV is evidently not enough, or it wouldn’t have kept going past 2x20CV so often. 5x20CV is the average amount of work based on Tripling reduction of folds, but setting of the maximum to 10x10CV is evidently not enough (results not shown). With the Fibonacci approach, we even reduced the
number of folds within the first 20CV pass by around 50%. Of course the main advantage of bigger K-CV is more data for learning, but at the cost of additional variance due to the extra models tested. In terms of the Fibonacci sequence, 14x14CV represents balanced last stopping points for a similar power Ttest and would seem to be worth exploring by Fibonacci pruning (future work). REFERENCES [1] [2]
[3]
Jim Entwisle and David M. W. Powers (1998). "The Present Use of Statistics in the Evaluation of NLP Parsers", pp215-224, NeMLaP3/CoNLL98 Joint Conference, Sydney, January 1998 Peter A. Flach (2003). The geometry of ROC space: Understanding Machine Learning metrics through ROC isometrics, Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003, pp. 226-233. D. M. W. Powers (2003), Recall and Precision versus the Bookmaker, Proceedings of the International Conference on Cognitive Science (ICSC-2003), Sydney Australia, 2003, pp. 529-534. (Accessed 28 Nov
2012 from http:// david.wardpowers.info/BM/index.htm .) [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
Arie Ben-David. (2008a). About the relationship between ROC curves and Cohen’s kappa. Engineering Applications of AI, 21:874–882, 2008. B. Di Eugenio and M. Glass (2004), The kappa statistic: a second look., Computational Linguistics 30:1 95-101. Arie Ben-David (2008b). Comparison of classification accuracy using Cohen’s Weighted Kappa, Expert Systems with Applications 34 (2008) 825–832 David M. W. Powers (2008), Evaluation Evaluation, The 18th European Conference on Artificial Intelligence (ECAI’08) Pierre Perruchet and Peereman, R. (2004). The exploitation of distributional information in syllable processing, J. Neurolinguistics 17:97−119. George Forman and Martin Scholz (2010), Apples-to-apples in crossvalidation studies: pitfalls in classifier performance measurement. SIGKDD Explorations12:1 49-57 M. J. Warrens (2010), Inequalities between multi-rater kappas. Advances in Data Analysis and Classification 4:271-286. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA data mining software: An update. SIGKDD Explorations, 11(1), 2009. I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. Yale: Rapid prototyping for complex data mining tasks. KDD 2006:935–940. D.M.W. Powers, 2011. Evaluation: From Precision, Recall and FFactor to ROC, Informedness, Markedness & Correlation, Journal of Machine Learning Technologies 2:1 37-63. https://dlweb.dropbox.com/get/Public/ 201101-Evaluation_JMLT_Postprint-Colour.pdf?w=abcda988
[14] C. Nadeau and Y. Bengio.(2003) Inference for the generalization
error. In Machine Learning 52:239–281 [15] Remco R. Bouckaert and Eibe Frank (2010) Evaluating the replicability of significance tests for comparing learning algorithms. Lecture Notes in Computer Science, 3056, 3-12. AAAI Press. [16] B. Blankertz, K-R. Muller, D.J. Krusienski, G. Schalk, J.R. Wolpaw, A. Schlog, G. Pfurtscheller, J. del R. Millan, M. Schroder, N. Birbaumer: 2006. The BCI competition III:Validating alternative approaches to actual BCI problems. IEEE Trans Neural Syst. Rehabil. Eng., 14:2, 153–159. [17] A. Atyabi, S.P. Fitzgibbon, & D.M.W. Powers, 2011. Multiplying the Mileage of Your Dataset with Subwindowing. Brain Informatics, Springer Lecture Notes In Computer Science, 6889, 173-184. [18] A. Atyabi, & D.M.W. Powers, 2012. The impact of segmentation and replication on non-overlapping windows: an EEG study, Sprint World Congress on Engineering and Technology, SCET2012, in press [19] D.M.W. Powers, 2012. The problem with Kappa. European Meeting of the Association for Computational Linguistics, EACL2012, in press [20] D.M.W. Powers, 2012. The problem of Area Under the Curve. International Conference on Information Science and Technology, ICIST2012, in press.