should be to make compelling claims for which he presented the. âMAGIC criteriaâ, a set of guidelines in his book âStatistics as. Principled Argumentâ. MAGIC is ...
Int J Biol Med Res.2018 ;9(2):6278-6281 Int J Biol Med Res www.biomedscidirect.com Volume 9, Issue 2, April 2018
Contents lists available at BioMedSciDirect Publications
International Journal of Biological & Medical Research Journal homepage: www.biomedscidirect.com
BioMedSciDirect Publications
International Journal of BIOLOGICAL AND MEDICAL RESEARCH
Original Article Determining the significance of the research results beyond p value Dr Dinesh Kumar Bagga Professor & Head, Dept of Orthodontics & Dentofacial Orthopaedics, School of Dental Sciences, Sharda University
A R T I C L E I N F O
ABSTRACT
Keywords:
The p value or statistical significance is of limited value from a clinical perspective because, a research finding with statistical significance may have little clinical significance whereas, a research finding with no statistical significance may be of clinical significance. Therefore, the statistical significance testing alone limits understanding and applicability of research findings in clinical decision-making due to its inability to provide all information about the clinical significance. While most researchers focus on statistical significance, clinicians and clinical researchers need to focus on clinical significance beyond statistical significance. Evidence-based practice is supposed to impact clinical decision-making in clinical practice based on proper interpretation of clinical research and implementation of clinically significant research outcome rather than statistically significant research outcome. A published journal article with a statistically significant result does not necessarily make it important as the statistically significant result does not mean that the difference between groups is big enough to be clinically significant or meaningful. Rather it simply means whether the difference is real (not explainable on sampling variability) or attributable to chance. Therefore, it is rather more important to understand what the p value is not than what it is. By understanding the limitations in clinical research posed by the heterogeneity of patient samples, small sample sizes, and limitations of hypothesis testing itself, clinicians must include other clinically-relevant measures such as effect size, clinically meaningful differences, confidence intervals, and magnitude-based inferences. Clinical researchers need to present research results with clinical significance beyond statistical significance and clinicians need to interpret and implement those clinically significant outcomes in their evidence based approach to clinical decision-making.
P value clinical significance statistical significance effect size statistical power
c
Copyright 2018 BioMedSciDirect Publications IJBMR - ISSN: 0976:6685. All rights reserved.
Manuscript: Null hypothesis significance testing (NHST) remains the cornerstone of clinical research. The p value is routinely used as a measure of statistical evidence against the null hypothesis in order to determine the significance of research results. The p values in hypothesis testing are used to classify the data into two groups being 'significant' or 'insignificant' depending upon whether it 'rejects' or 'fails to reject' the null hypothesis.1 A level of significance (alpha level) is set between 0 and 1 as an arbitrary cut off value to determine statistical significance. Researchers routinely choose an alpha level of 0.05 for testing their hypotheses. However, there are some situations where a lower alpha level (e.g. 0.01) or a higher alpha level (e.g. 0.10) may be desirable. At α level of 0.05, p≤0.05 rejects null hypothesis, when null hypothesis is false, declaring the difference between groups as 'statistically significant' and indicates that the difference is real, not due to sampling variability. On the other hand, p>0.05 fails to reject null hypothesis when null hypothesis is true declaring the difference between groups as 'statistically insignificant' and * Corresponding Author : Dr Dinesh Kumar Bagga
Professor & Head Dept of Orthodontics & Dentofacial Orthopaedics School of Dental Sciences Sharda University Plot 32, 34, Knoweledge Park III, Greater Noida, UP (India) PIN 201306 Ph: +91 9868071583 c Copyright 2018. CurrentSciDirect Publications. IJBMR - All rights reserved.
indicates that the difference is attributable to chance due to sampling variability.2 The α level is the probability of making a Type I (false alarm) or alpha error by incorrectly rejecting the null hypothesis when the null hypothesis is actually true. The smaller the value of alpha, the less likely it is that we reject a true null hypothesis. If we choose a stringent alpha level of 0.01, then rejecting the null hypothesis becomes very difficult and it is less likely that null hypothesis is rejected when it is actually true. This would minimize type 1 error but this tiny area of rejection may make the rejection of null hypothesis difficult even to the extent that we may fail to reject null hypothesis when it is false leading to a Type II or beta error. In other words, the more we try to minimize a Type I error, the more likelihood of a Type II error creeping in would be. Scientists have found an alpha level of 0.05 to be a good balance between these two issues. There are some instances in which we would need a larger value of alpha (an α-level of 0.10 or even greater than 0.10) to reject a null hypothesis where it is more acceptable to have a Type I error.3 In medical screening, a false positive test for a disease (Type I error) will result in anxiety to the patient and require the patient to undergo further confirmatory tests which will prove himself to be negative for the disease. However a false negative test for a disease (Type II error) will categorize him as a normal person without disease when in fact he does have, letting him remain untreated and putting him on risk of developing complications. Given the choice we would rather accept a false positive (Type I error) than a false negative (Type II error) in this condition.
Dinesh Kumar Bagga et al. / Int J Biol Med Res.9(2):6278-6281
There are some instances in which we would need a smaller value of alpha (an α-level of 0.01) to reject a null hypothesis, where it is more acceptable to have a Type II error. In surgical treatment of organ removal in cancer, there must be a high degree of evidence to reject the null hypothesis so as to avoid any chances of organ removal which is cancer free. This will result in more likelihood of Type II error. As there is a delicate balance between type 1 error and type 2 error, it is a judgment call for setting a somewhat arbitrary alpha level by making a choice between the lesser of two evils i.e. a type I error or a type II error. Is it better to say, there is an effect when there wasn't (type I error) or is it better to say there is no effect when there was (type II error)?4 Type II or β error is failure of rejecting null hypothesis when it is actually false, leading to the failure of detecting the difference between groups. The probability of detecting the difference, when one exists, by rejecting null hypothesis is afforded by the statistical power of the statistical significance test (1-β). The statistical power is primarily determined by the size of the effect as well as the size of the sample. Increase in either of the two i.e. the effect size and/or the sample size is said to increase the statistical power. The statistical power, the sample size, the effect size and the statistical significance level are closely associated. If the values of three out of these four factors are known, the fourth one can be calculated. The power analysis determines the sample size needed in a study to detect a statistically significant difference with an appropriate effect size. Statistical significance is set at 0.05. Statistical power of 80% (0.8) is generally accepted. The sample size can be calculated following effect size obtained using previous studies or a pilot study. Greater the effect size is, the smaller is the sample size required. Significance tests are highly dependent on sample size. When the sample size is small; the strong and important effects can be statistically insignificant due to low statistical power of the significance test. When the sample size is large; even trivial effects can have impressive p values due to the increased probability of detecting even a minute difference as a result of high statistical power. In other words, a research outcome of statistical significance may be of a little clinical significance whereas statistically insignificant research outcome may be of clinical significance. An increase in sample size leading to an increase in statistical power may enable the statistical significance test to label the 'insignificant' difference as the 'significant' difference. The sensitivity of statistical significance testing to the sample size and abuse of p value based on its rather arbitrary cutoff point of alpha level for determining significance are the limitations of hypothesis testing. In clinical research, there may be further limitations due to patient samples being small in size or being heterogeneous in nature. While most researchers focus on statistical significance, the applicability of p value or statistical significance is quite limited in clinical research due to its inability to provide all information about the clinical significance. There is a growing recognition of the limitations associated with hypothesis testing and p values as the sole criterion for interpreting the results because the p values from NHST reflect both the sample size and the magnitude of effects studied. What, is needed, is an estimate of the magnitude of effect that is relatively independent of sample size.5, 6, 7 A published journal article with a statistically significant result does not necessarily make it important as the statistically significant result does not mean that the difference between groups is big enough to be of clinical significance. Rather it simply means whether the difference is real or attributable to chance. Therefore, it is not enough only to understand what the p value is but it is also important to understand what the p value is not.
6279
Evidence-based practice is supposed to impact clinical decision-making in clinical practice based on proper interpretation of clinical research and implementation of clinically significant (not statistically significant) research outcome. Since the interpretations derived from traditionally performed statistical significance testing alone have the potential to be flawed, many researchers insist on reporting other clinically relevant measures for estimating clinical significance such as measures of the magnitude of effects (i.e. effect size statistics) and their confidence intervals. The effect size tells us how strongly two or more variables are related or how large the difference between groups is. Therefore, clinicians and clinical researchers need to focus on clinical significance beyond statistical significance.8,9 Clinical significance makes the clinician understand whether or not the treatment is effective enough in order to apply it into clinical practice. In other words, the statistical significance offers answer to the basic fundamental question “Does it work?” or “Does there exist the difference” whereas clinical significance offers solution to the far advanced question “How well does it work?” or “How big the difference is” respectively. Robert Abelson10 emphasized that the goal of statistical analysis should be to make compelling claims for which he presented the “MAGIC criteria”, a set of guidelines in his book “Statistics as Principled Argument”. MAGIC is an acronym for: Magnitude - How big is the effect? Large effects are more compelling than small ones. Articulation - How specific is it? Precise statements are more compelling than imprecise ones. Generality - How generally does it apply? More general effects are more compelling than less general ones. Claims that would interest a more general audience are more compelling. Interestingness - interesting effects are those that "have the potential, through empirical analysis, to change what people believe about an important issue". More interesting effects are more compelling than less interesting ones. In addition, more surprising effects are more compelling than ones that merely confirm what is already known. Credibility - Credible claims are more compelling than incredible ones. The researcher must show that the claims made are credible. Results that contradict previously established ones are less credible. Estimation of clinical significance In an attempt to estimate clinical significance, many researchers often mistakenly relate statistical significance with clinical significance assuming level of statistical significance as an index of effect size thereby, implying smaller the p value, larger the treatment effect.8,11 This misunderstanding is possibly due to the fact that when sample size is held constant, the value of p correlates with effect size for some statistical significance tests due to the p value by itself reflecting both the sample size and the magnitude of effects studied. However, that relationship completely breaks down as the sample size changes. The effect size is an important tool in reporting and interpreting clinical significance. The concept and estimating principles of effect sizes along with CIs expand the approach to statistical thinking. Rather than reporting the difference in terms of measurement units, effect size is standardized scale-free measure. The routine use of effect sizes, however, has generally been limited to meta-analysis (for combining and comparing estimates from
Dinesh Kumar Bagga et al. / Int J Biol Med Res.9(2):6278-6281
6280
different studies with similar designs) and is all too rare in original research. Clinical researchers should include standardized effect sizes in their original research results as the standard method of quantitative review in clinical research which makes them to view their results in the context of previous research. The effect size indicates the magnitude of the observed effect or stength of the relationship between variables, whereas the statistical significance (p value) indicates the likelihood that the effect or relationship is due to chance. It is important to note that effect size estimation and statistical significance testing (also known as hypothesis testing) are complementary analyses, and both should be considered when evaluating quantitative research findings. Here is a general guide to select the appropriate effect size measure for the data followed by estimation and interpretation of effect size indices. The effect size metrics used for parametric tests are broadly grouped into two main categories: 1) those for comparing the groups (d family) 2) those for determining strength of associations between variables ® family) 1)
Comparing the Groups (d family):
The appropriate measure of effect size depends on the type of data collected and the type of statistical test used. When means of continuous variables between two groups are compared using a t test, the appropriate effect size measure is Cohen's d. Cohen's d is a measure of the standardized mean difference between two groups. The mean difference between two groups is normalized to the pooled SD of the two groups.12 Cohen's d = (M2 ─M 1) ∕ SDpooled Where: SDpooled = √ { (SD1)2 + (SD2)2 ∕ 2} In other words, it is the ratio of the mean difference between two groups to the standard deviation. This indicates that a d value of 0.5 can be interpreted as the group means differing by 0.5 SDs. This measure can be used only when the sample sizes of two groups are the same and the SDs of two groups are not significantly different.
Other effect size measures commonly reported with ANOVA, multivariate analysis of covariance (MANCOVA), and analysis of covariance (ANCOVA) results are eta-squared and partial etasquared. For one way ANOVA, they turn out to be the same but in more complicated models their values differ. Eta-squared is calculated as the ratio of the between-groups sum of squares to the total sum of squares.15 Alternatively partial eta-squared is calculated as the ratio of the between-groups sum of squares to the sum of the between-groups sum of squares and the error sum of squares.16 Partial Eta squared (η2) indicates the proportion of variance attributable to a factor.17 Cohen's f = Square Root of (eta squared / 1- eta squared) Pearson's chi-square ( χ2) test has been widely used in testing for association between two categorical responses for which the appropriate effect size measure is odds ratio. A chi-square is to test if an association exists while an odds ratio is to measure or quantify the association between a risk factor/covariate and an outcome. Chisquare is a test of association and Odds Ratio is a measure of association.18 Odds ratio describes the likelihood of an outcome occurring in the treatment group compared with the likelihood of the outcome occurring in the control group. An odds ratio equal to 1 means that the odds of the outcome occurring is the same in the control and treatment groups. Likewise, an odds ratio of 2 indicates that the outcome is two times more likely to occur in the treatment group when compared with the control group. However, the odds ratio alone does not quantify treatment effect, as the magnitude of the effect depends not only on the odds ratio but also on the underlying value of one of the odds in the ratio. For example, if a new treatment increases the odds of new treatment by 50% compared with the existing treatment, then the odds ratio is 1.5. However, if odds control =0.002 and odds treatment = 0.003, the increase is most likely not practically meaningful. On the other hand, if an odds control = 0.2 and the odds treatment = 0.3, this could be interpreted as a substantial increase that one might find practically meaningful. Table I. Interpreting effect size values (Effect size measures as small, medium, large or very large effect sizes)
If the sample sizes of two groups are the same but the SDs of the populations belonging to two groups are significantly different, then pooling the sample SDs is not appropriate. Glass's delta, (another variation of Cohen's d ) normalizes the difference between two means to the SD of the control sample. This method assumes that the control group's SD is most similar to the population SD, because no treatment is applied.13 If the sample sizes between the two groups differ significantly, Hedges' g is a variation of Cohen's d that can be used to weight the pooled SD based on sample sizes.14 When comparing means of three or more groups, say, using an analysis of variance (ANOVA) test, Cohen's f is an appropriate effect size measure to report (Cohen, 1988). In this method, the sum of the deviations of the sample means from the combined sample mean is normalized to the combined sample SD. This test does not distinguish which means differ, but rather just determines whether all means are the same.
Table II. Comparison of thresholds for d and r
Dinesh Kumar Bagga et al. / Int J Biol Med Res.9(2):6278-6281
Although a small d = 0.20 is equivalent to a small r = 0.10, the r equivalent of a medium-sized d = 0.5 and that of large sized d = 0.8 are overestimated and need to be corrected as r = 0.24 and r = 0.37 respectively. Conclusion There are limitations associated with using p values as the sole criterion for interpreting the research results in order to apply them into clinical practice. Therefore, clinicians and clinical researchers need to focus on clinical significance beyond statistical significance. The statistical significance and the effect size indices are complementary analyses. The statistical significance indicates the likelihood that the observed effect is real or due to chance whereas the effect size index indicates the magnitude of the observed effect or strength of the relationship between the variables. Although the routine use of effect sizes has usually been limited to meta-analysis for combining and comparing estimates from different studies of similar designs, the effect sizes should also be reported in primary studies along with statistical significance for interpretation of the research outcome.
6281
16. Cohen J (1973). Eta-squared and partial eta-squared in fixed factor ANOVA designs. Educ Psychol Meas 33, 107–112. 17. Levine TR, Hullett CR (2002). Eta squared, partial eta squared, and misreporting of effect size in communication research. Hum Commun Res 28, 612–625. 18. Thompson B (1996). AERA editorial policies regarding statistical significance testing: three suggested reforms. Educ Res 25, 26–30. 19. Maher JM, Markey JC, Ebert-May D. The other half of the story: Effect size analysis in quantitative research. CBE—Life Sciences Education 2013;12: 345–51. 20. Rosenthal R, Rubin DB (2003). r-equivalent: a simple effect size indicator. Psychol Methods 8, 492–496. 21. Cohen J (1992). A power primer. Psychol Bull 112, 155–159. 22. Rosenthal JA (1996). Qualitative descriptors of strength of association and effect size. J Social Serv Res 21, 37–59. 23. Rosenthal, R., Rosnow, R.L. (1984). Applying Hamlet's question to the ethical conduct of research: A conceptual addendum. American Psychologist, 39, 561–563.
REFERENCES 1.
Kirk RE (1996). Practical significance: a concept whose time has come. Educ Psychol Meas 1996;56:746–759.
2.
Gigerenzer G, Swijtink Z, Porter T, Daston L, Beatty J, Kruger L. The empire of chance: How probability changed science and everyday life.1989, Cambridge university press.
3.
Palesch YY. Some common misperceptions about p-values. Stroke 2014; 45(12):e244–e246
4.
Banerjee A, Chitnis UB, Jadhav SL, Bhawalkar JS, Chaudhury S. Hypothesis testing, type I and type II errors. Ind Psychiatry J 2009; 18(2): 127–131.
5.
Va s ke J J ( 2 0 0 2 ) . C o m m u n i c a t i n g j u d g m e n t s a b o u t p ra c t i c a l significance:effect size, confidence intervals and odds ratios. Human Dimens Wildl 7, 287–300.
6.
Nakagawa S, Cuthill IC (2007). Effect size, confidence interval and statistical significance: a practical guide for biologists. Biol Rev Camb Philos Soc 82, 591–605.
7.
Ferguson CJ (2009). An effect size primer: a guide for clinicians and researchers. Prof Psychol Res Pract 40, 532–538.
8.
Thompson B (1993).The use of statistical significance tests in research: bootstrap and other alternatives. J Exp Educ 61, 361–377.
9.
Fan X (2001). Statistical significance and effect size in education research: two sides of a coin. J Educ Res 94, 275–282.
10. Abelson RP. Statistics as principled argument 1995.Hillsdale NJ, Erlbaum. 11. Nickerson RS (2000). Null hypothesis significance testing: a review of an old and continuing controversy. Psychol Methods 5, 241. 12. Cohen J (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd ed., Hillsdale, NJ: Lawrence Erlbaum. 13. Glass GV, McGaw B, Smith M (1981). Meta-Analysis in Social Research, Beverly Hills, CA: Sage. 14. Hedges LV (1981). Distribution theory for Glass's estimator of effect size and related estimators. J Educ Stat 6, 106–128. 15. Kerlinger FH (1964). Foundations of Behavioral Research,New York: Holt, Rinehart and Winston.
c Copyright 2018 BioMedSciDirect Publications IJBMR - ISSN: 0976:6685. All rights reserved.