Article
Perspectives on the Use of Null Hypothesis Statistical Testing. Part II: Is Null Hypothesis Statistical Testing an Irregular Bulk of Masonry?
Educational and Psychological Measurement 2017, Vol. 77(4) 613–615 Ó The Author(s) 2016 Reprints and permissions: sagepub.com/journalsPermissions.nav DOI: 10.1177/0013164416667987 journals.sagepub.com/home/epm
Fernando Marmolejo-Ramos1 and Denis Cousineau2 Keywords methodology, null hypothesis testing The Silver Samarsanda stood above the Jardeen, behind a line of tall pencil cypress: an irregular bulk of masonry, plastered and whitewashed, with a wide, many-slanted roof of mossy tiles. Beside the entrance five colored lanterns hung in a vertical line: deep green, a dark, smoky scarlet, a gay light green, violet, and once more dark scarlet; and at the bottom, slightly to the side, a small, steady yellow lamp, the purport of all being: Never neglect the wonder of conscious existence, which too soon comes to an end! —Jack Vance, The Brave Free Men, 1970 Gotham City. Clean shafts of concrete and snowy rooftops. The work of men who died generations ago. From here, it looks like an achievement. From here, you can’t see the enemy. —Frank Miller, The Dark Knight Returns, 1986
This second part of the special issue on ‘‘Perspectives on the Use of Null Hypothesis Statistical Testing (NHST)’’ is dedicated to the classic null hypothesis statistical testing framework. This framework slowly evolved from the work of Bernoulli (1713/ 2005) and Gauss (1809/1864), among others. Fisher (1925, 1935/1951) formalized the idea of hypothesis testing, building on the core idea of a null hypothesis, and the evaluation of the probability of the data under this hypothesis. At this stage, there 1
Stockholm University, Stockholm, Sweden University of Ottawa, Ottawa, Ontario, Canada
2
Corresponding Author: Fernando Marmolejo-Ramos, Go¨sta Ekman Laboratory, Department of Psychology, Stockholm University, Frescati Hagva¨g 9A, Stockholm 11419, Sweden. Email:
[email protected]
614
Educational and Psychological Measurement 77(4)
were little sources for confusion: A model of the null hypothesis (the null model) is set down, often based on the famous normality assumption. In parallel, a model of the observed data is laid down. Both models are compared with a likelihood ratio. The model of the observed data always fits best, but what matters is the loss in fit of the null model. An F statistics (or the square of a t statistic) is obtained by computing twice the decrement in fit when fits are expressed using log likelihoods. The importance of the decrement is then assessed by finding the probability of a decrement at least as large as the one observed given that there would be no difference between the two models (see Cousineau & Allan, 2015, for an overview). Thus, in Fisher’s view, a small p value indicates that it is not desirable to describe the data using the null model. In our opinion, things got confusing after the works of Neyman and Pearson (1933). These two researchers wanted to identify optimal statistical tests. The Neyman–Pearson lemma concluded that likelihood ratio tests are the most powerful tests for a given significance level a. To that end, they needed to set the true population difference (the true effect size). They called this true effect size the alternative hypothesis, a formulation that Fisher strongly opposed (Cohen, 1990, p. 1307). This wording is very unfortunate because the population true effect size is not a hypothesis, and in any event, the likelihood ratio tests are not meant to evaluate dual hypotheses. In subsequent textbooks, the alternative hypothesis creeped in, but with a modified signification: The alternative hypothesis became anything but the null hypothesis (e.g., Snedecor, 1946). The alternative hypothesis is not part of NHST; its inclusion leads to paradoxical situations (e.g., Wagenmakers, 2007, found that given a data set, rejection and nonrejection decisions are obtained from NHST and from Bayesian computations, respectively). This is one example of how a logical construction can be altered through the years to the point that it loses its foundations. In the second part of this special issue, three articles defend the soundness of NHST. Ha¨ggstro¨m presents arguments to embrace such an approach and indicates ways for its appropriate use. Garcı´a-Pe´rez argues that current alternatives to null hypothesis statistical tests do not solve issues this type of testing is said to have. Miller reinforces the idea that null hypothesis testing is an appropriate method to do science. We also begin to examine alternatives to NHST (more will follow in a subsequent issue). One alternative to the NHST that has received much attention is that of inferences based on confidence intervals. Wiens and Nilsson illustrate this approach with contrast analyses in factorial designs where confidence intervals are rarely given. Another growing field of research in statistics is how to reconcile parameter estimation and inference with the presence of outliers and non-Gaussian distributions. Wilcox and Serang demonstrate how robust statistics can address problems that null hypothesis, confidence intervals, and Bayesian approaches have. Declaration of Conflicting Interests The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Marmolejo-Ramos and Cousineau
615
Funding The author(s) received no financial support for the research, authorship, and/or publication of this article.
References Bernoulli, J. (2005). The Art of Conjecturing, together with Letter to a Friend on Sets in Court Tennis (E. Sylla, Trans.). Baltimore, MD: Johns Hopkins University Press. (Original work published 1713) Cohen, J. (1990). Things I have learned (so far). The American Psychologist, 45, 1304-1312. doi:10.1037/0003-066X.45.12.1304 Cousineau, D., & Allan, T. A. (2015). Likelihood and its use in parameter estimation and model comparison. Mesures et E´valuation en E´ducation, 37(3), 63-98. Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh, Scotland: Oliver & Boyd. Fisher, R. A. (1951). The design of experiments. New York, NY: Hafner. (Original work published 1935) Gauss, C. F. (1864). Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum [Theory of the motion of the heavenly bodies moving about the sun in conic sections]. Paris: A. Bertrand. (Original work published 1809) Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 231, 289-337. doi:10.1098/rsta.1933.0009 Snedecor, G. W. (1946). Statistical methods: Applied to experiments in agriculture and biology (4th ed.). Ames, IA: Collegiate Press. Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779-804. doi:10.3758/BF03194105