Original Article
Rethinking Observed Power Concept, Practice, and Implications Shuyan Sun, Wei Pan, and Lihshing Leigh Wang School of Education, University of Cincinnati, OH, USA Abstract. Observed power analysis is recommended by many scholarly journal editors and reviewers, especially for studies with statistically nonsignificant test results. However, researchers may not fully realize that blind observance of this recommendation could lead to an unfruitful effort, despite the repeated warnings from methodologists. Through both a review of 14 published empirical studies and a Monte Carlo simulation study, the present study demonstrates that observed power is usually not as informative or helpful as we think because (a) observed power for a nonsignificant test is generally low and, therefore, does not provide additional information to the test; and (b) a low observed power does not always indicate that the test is underpowered. Implications and suggestions of statistical power analysis for quantitative researchers are discussed. Keywords: power analysis, observed power, significance test, effect size
‘‘Take seriously the statistical power considerations associated with your tests of hypotheses’’ (American Psychological Association [APA], 2001, p. 24). This is what the fifth edition of APA Publication Manual suggests to quantitative researchers. The sixth edition of APA Publication Manual reemphasized the importance of power analysis when applying inferential statistics (APA, 2010). However, the problem is how to consider statistical power. Observed power analysis is recommended by scholarly journal editors and reviewers, especially for studies with nonsignificant test results (Hoenig & Heisey, 2001; Levine & Ensom, 2001; Yuan & Maxwell, 2005). It is a good sign that more and more researchers will utilize power analysis to do quantitative research. However, the usefulness of observed power analysis may not have been fully investigated. The aims of the present study were to (a) briefly review the concept of statistical power and its implementation in the literature; (b) demonstrate that observed power analysis could lead to an unwarranted endeavor; and (c) provide suggestions for better practice of statistical power analysis.
Statistical Power Analysis in Theory In Cohen’s seminal 1988 book (Cohen, 1988), power of a statistical test was defined as ‘‘the probability [assuming the null hypothesis is false] that it will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists’’ (p. 4). In other words, power is the conditional probability of making a correct decision to reject H0 when it is actually false, or Power ¼ PðReject H 0 jH 0 is falseÞ: Two types of decision errors may occur in statistical testing: Type I error (a) and Type II error (b). Type I error is the Ó 2011 Hogrefe Publishing
conditional probability of rejecting H0 when it is actually true and Type II error is the conditional probability of failing to reject H0 when it is actually false. They can be expressed as a = P(Reject H0|H0 is true) and b = P(Not reject H0|H0 is false). Power and b are complementary, thus power = 1 b. After the hypotheses are stated and a level is set, power of the test can be estimated. The general form of power estimation for a two-tailed test is Power = P(|test statistic| > critical value|d), where d is the noncentrality measure, measuring how far the true value of parameter is from the parameter estimate in terms of standard error (standard deviation of the estimator) or estimated standard error; all of the factors in the equation depend on the sampling distribution of the estimator. Power is influenced by the choice of a, the direction of the hypotheses (either one-tailed or two-tailed), sample size, and population effect size. Because of the use of fixed a level and the popularity of two-tailed tests in the social sciences, researchers are more concerned about the last two factors. Low power may either result from a small sample size or a small population effect size, or both. Researchers have to make sure their tests have at least acceptable power for the sake of the validity of the study, thus power analysis plays an important role in quantitative research.
Implementation of Statistical Power Analysis As briefly reviewed in the previous section, statistical power is nothing but a probability value estimated from the specified hypotheses, designated a level, sample size, and population effect size before the data are collected. In the literature, power analysis conducted before the test is known as prospective power or prior power in the literature (Kline, 2004; O’Keefe, 2007; Onwuegbuzie & Leech, Methodology 2011; Vol. 7(3):81–87 DOI: 10.1027/1614-2241/a000025
82
S. Sun et al.: Rethinking Observed Power
2004). It involves an estimation based on the level of a, sample size, and expected population effect size derived from theory or previous research. Usually the expected effect size is the minimum value considered to be substantively or practically significant according to the literature in the field. If researchers are uncertain about the population effect size, power can be estimated for a range of possible population effect sizes, yielding a series of power curves in which the influence of effect size on statistical power can be clearly indicated. The relationship between power, a, sample size, and effect size is ‘‘when any three of them are fixed, the fourth is completely determined’’ (Cohen, 1988, p. 14). Also, power function is a strictly increasing function of sample size and effect size when a is fixed: the larger the sample size, the higher the power; and the larger the effect size, the higher the power. Thus, prospective power analysis is usually done as sample size estimation in the research planning phase to estimate the minimum sample size needed to obtain the desired level of power, given the effect size of interest and a level. The first and foremost consequence of failing to consider statistical power in research planning phase is the inflated Type II error because of low statistical power (Cohen, 1988). Power analysis is relevant to the tests that have not yet been performed (Gerard, Smith, & Weerakkody, 1998); thus, theoretically it should be done before the test. However, power analysis can also be done in a retrospective way, that is retrospective power analysis. It estimates the overall statistical power of research fields or journals (Cohen, 1962; Sedlmeier & Gigerenzer, 1989). Simply put, it estimates how many tests are statistically significant out of the total number of tests within a journal or a field. It is mainly for meta-analytical review of statistical power. For a given predicted or hypothetical population effect size, fixed sample size, and a, the mathematical forms are the same for both prospective power and retrospective power though estimated at different times. Therefore, one form of retrospective power analysis is to determine the power of an experiment to detect a predicted effect sized for a specified sample size and a (Cohen, 1992). For instance, Cohen estimated the average power to detect a hypothetical power across 70 studies and found out that the average power for detecting small, medium, and large effect sizes was only 0.18, 0.48, and 0.83, respectively (Cohen, 1962). In the literature there is another type of power analysis known as observed power, a.k.a. achieved power or post hoc power. Observed power is estimated based on the observed effect size, which is assumed to be exactly equal to the population effect size (Hoenig & Heisey, 2001; O’Keefe, 2007); then, the probability of rejecting the null hypothesis given the sample size and a level is estimated. It is recommended by scholarly journal editors and reviewers, especially for nonsignificant test results (Hoenig & Heisey, 2001; Levine & Ensom, 2001; Yuan & Maxwell, 2005). Hoenig and Heisey (2001) cited 19 articles published in 1983–1997 that recommended the use of observed power to interpret nonsignificant results. Most recently, Onwuegbuzie and Leech are the advocates of observed power for nonsignificant results. They ‘‘believe that post hoc power analyses should play a central role in null hypothMethodology 2011; Vol. 7(3):81–87
esis significance testing’’ (Onwuegbuzie & Leech, 2004, p. 225). It is worth noting that there are no universally defined or accepted labels or terms for various power analyses in the literature. Some researchers might have been using the two terms observed power and retrospective power interchangeably. They could be the same or different based on what information is utilized to estimate the power. O’Keefe (2007) proposed that the labels such as ‘‘post hoc,’’ ‘‘observed,’’ ‘‘retrospective,’’ ‘‘achieved’’ power should be avoided because they are potentially confusing and misleading. In order to facilitate the following discussions in this paper, observed power analysis refers to estimating power from an observed effect size from a single study, whereas retrospective power analysis refers to the proportion of significant results after the completion of multiple tests (Cohen, 1962).
Observed Power Analysis As early as 1997, Steidl, Hayes, and Schauber noticed that ‘‘when a null hypothesis is not rejected, it has become an increasingly common practice to inquire about the power of the statistical test’’ (Steidl, Hayes, & Schauber, 1997, p. 274). This practice is an additional effort to distinguish two possibilities: not rejecting null hypothesis because of no effect (a correct statistical decision) versus failing to reject an incorrect null hypothesis (a Type II error). Can they be distinguished by the observed power analysis? Advocates of observed power analysis believe that observed power analysis can ‘‘help to rule in/out rival explanations in the presence of statistically nonsignificant findings’’ (Onwuegbuzie & Leech, 2004, p. 201). However, this advocate faces several challenges that question the possible misuse and misinterpretation of observed power. The greatest challenge to the arguments that advocate the appropriateness of observed power in addressing the nonsignificance is that the observed effect size is a random variable as a function of sample statistics. In a single test, the observed effect size can be very different from the underlying population effect size due to sampling error and, therefore, observed effect sizes usually vary across studies. In other words, an observed effect in each test is just ‘‘one of many reasonable possibilities for the true population parameter’’ (Froman & Shneyderman, 2004, p. 372). Therefore, the observed power based on a single test cannot estimate the true chance of detecting a true population effect. However, advocates of observed power assumed the observed effect size is exactly equal to the true population effect size, an assumption too strong to be true. The second challenge to the advocates of observed power analysis is the nature of statistical power. As a probability, power is relevant to significance tests that have not yet been done, but not to tests that have already been completed (Gerard et al., 1998). Once the study is done, the result, statistically significant or not, is either a correct decision or an error. It is impossible to decide whether it is an error or not because the population effect is generally unknown. Even if there is a significant population effect, it is still possible that the test turns out to be nonsignificant. Power can be manipuÓ 2011 Hogrefe Publishing
S. Sun et al.: Rethinking Observed Power
lated to be 1 and Type II error 0 by increasing the sample size to large enough, but the consequence is an inflated Type I error. Therefore, power analysis should be conducted in research planning phase. As Kline pointed out, ‘‘a post hoc analysis that shows lower power is more like an autopsy than a diagnostic procedure. That is, it is better to think about power before it is too late’’ (Kline, 2004, p. 43). The third challenge is the usefulness of observed power in interpreting significance test results. Both p value and observed power depend on the observed effect size and so are inversely related such that a test with a high p value tends to have low power and vice versa; therefore, observed power restates what a p value already states (Hoenig & Heisey, 2001; Thomas, 1997). Observed power generally underestimates the true power (Gillett, 1994); for nonsignficant results, observed power will usually be < 0.5 after adjustment for bias and has no direct correlation with true power (Goodman & Berlin, 1994; Hayes & Steidl, 1997; Reed & Blaustein, 1995; Steidl et al., 1997). Yuan and Maxwell (2005) mathematically proved that observed power can be positively or negatively biased against true power. They also tried different bias-correction methods but no better estimator of true power can be identified. Therefore, observed power is not as helpful as we hoped in interpreting significance test results.
Review of Observed Power Analysis in Published Studies In order to empirically investigate the actual practice in reporting observed power, we conducted a small-scale literature search in the PsycINFO database using the keywords of ‘‘observed power’’ or ‘‘post hoc power’’ or ‘‘achieved power.’’ Note that the purpose of this search was limited to finding several studies that actually estimated and reported observed power. It was not intended to be a systematic or comprehensive review; thus, the general procedures of comprehensive review were not strictly followed. Among the 14 studies identified by the search (Brennan, Hellerstedt, Ross, & Welles, 2007; Chen & Bradshaw, 2007; Chyumg, 2007; Eigsti, Bennetto, & Dadlani, 2007; Heath, Toste, Nedecheva, & Charlebois, 2008; Heslin, Vandewall, & Latham, 2006; Kahan, & Mathis, 2007; Mamede, Schmidt, Rikers, Penaforte, & Coelho-Filho, 2007; Mazefsky, & Oswald, 2007; Reynolds, Baker, & Pedersen, 2000; Rezvan, Ahmadi, & Abedi, 2006; Schorr, 2006; Southam-Gerow, Silverman, & Kendall, 2006; Spatariu, Hartley, Schraw, Bendixen, & Quinn, 2007) the observed power ranged from 0.52 to 1.0 for significant results; for nonsignificant result, the observed power was much lower, ranging from 0 to 0.65. Some interpretations of observed power found in the 14 studies may not be appropriate. First, low observed power was blamed for the nonsignificant results (e.g., Brennan, Hellerstedt, Ross, & Welles, 2007; Eigsti, Bennetto, & Dadlani, 2007; Mazefsky & Oswald, 2007). As mentioned earlier, a nonsignificant result may be a correct statistical decision or a Type II error. However, a lower observed power cannot tell which is the true state and, therefore, does Ó 2011 Hogrefe Publishing
83
not provide additional information to explain why significance cannot be achieved. Second, low observed power was interpreted as the consequence of a small sample size; and therefore, a larger sample was suggested for future research (e.g., Heath, Toste, Nedecheva, & Charlebois, 2008; Reynolds, Baker, & Pedersen, 2000; Schorr, 2006). Achieved significance suggests the sample size is adequate for the test; however, Chyung (2007) study clearly showed that the observed power for a significant result can be as low as 0.52, almost like tossing a fair coin. Therefore, lower observed power does not mean the sample size is inadequate. Also, the observed effect size in a larger sample may be very different; therefore, significance or high observed power may not always be guaranteed by obtaining a larger sample size in future research. Third, high observed power was interpreted as evidence for a population effect (e.g., Heslin, Vandewall, & Latham, 2006; Rezvan, Ahmadi, & Abedi, 2006); relatively low observed power was interpreted as evidence for weak practical implication (Chyung, 2007). In fact, a power value only suggests the chance of detecting a true population effect if it really exists; it does not speak about the existence or nonexistence of the population effect. Fourth, if the sample size had adequate power to detect a hypothetically meaningful population effect, a nonsignificant test with low observed power was interpreted as evidence for a very small population effect (e.g., Southam-Gerow, Silverman, & Kendall, 2006). In fact, an adequate sample size for a hypothetical population effect only guarantees that the test will have adequate power to detect the hypothetical effect; it cannot be used as the evidence of the nonexistence of population effect if the result is not significant.
Purpose of the Study The small-scale review of published research suggested that observed power for nonsignificant results is usually low and the interpretation of observed power may not be appropriate. To theoretically verify our conclusion from the review, a Monte Carlo simulation was conducted to show that (a) for nonsignificant tests, the observed power estimated from observed effect sizes is usually 0.5 or lower; (b) nonsignificant tests usually have low observed effect size, no matter whether the sample size is adequate to detect a meaningful difference or not; and (c) retrospective power is approximately equal to the prospective power of the test based on a large number of replications.
Method SAS RANNOR (SAS Institute, 2003) was used to generate random numbers to compare the means of two distributions: Y0 N(0, 1) versus Y1 N(l, 1), where l = 0.2, 0.5, and 0.8, respectively. In other words, the population effect size (Cohen’s d) was manipulated to be 0.2, 0.5, Methodology 2011; Vol. 7(3):81–87
84
S. Sun et al.: Rethinking Observed Power
and 0.8, respectively. Sample size was fixed as n = 64 per group so that the power of the test is about 0.8 when population effect size is 0.5, which is the cutoff value for medium effect based on Cohen’s benchmarks (Cohen, 1988). Ten thousand independent-sample t-tests (two-tailed) against a = .05 in each of the three scenarios were conducted using SAS PROC TTEST and observed power was estimated using SAS data steps. The number of replications met Cochrane’s criteria for the robustness of simulation studies (Serlin, 2000).
0.5 (Goodman & Berlin, 1994; Hayes & Steidl, 1997; Reed & Blaustein, 1995; Steidl et al., 1997). When the true population effect size is 0.5 or 0.8, the independent t-tests with a group size of 64 are adequately powered to detect such a population effect; however, as shown in Table 1, the observed power for nonsignificant tests is about 0.5 or lower. Therefore, a low observed power does not always imply that the sample size is too small to detect the true population effect.
Observed Effect Size and Observed Power
Results and Discussion
The observed effect sizes for nonsignificant and significant test results are summarized in Table 2. Clearly, effect sizes for nonsignificant tests are usually very low, 0.35 or lower. Figure 2 shows the relationship between observed effect size and observed power. Clearly observed power is just an increasing function of observed effect size. As observed effect size increases, the power of the test increases. This relationship holds for the three population effect sizes. Recall that in the present simulation the selected sample size was adequate for population effect size of 0.5. Therefore, low observed power cannot be interpreted as the consequence of small sample size because the sample size is not small for population size 0.5!
Observed Power and p Value As shown in Figure 1, observed power is a decreasing function of p value. The population effect size was simulated to be 0.2, 0.5, and 0.8, respectively. No matter how large the population effect is, the observed power for nonsignificant tests (i.e., p .05) is always about 0.5 or less. Table 1 further shows observed power of nonsignificant tests varies from 0.0500 to 0.5015. These results empirically verified the previous arguments that observed power of nonsignificant tests will always be less than about
Figure 1. Relationship between observed power and p value.
Table 1. Observed powera for significant and nonsignificant tests Population effect size (prospective power) Test results
0.2 (0.2022)
b
0.5 (0.8015)b
0.8 (0.9943)b
Nonsignificant
M SD Range
0.1896 0.1315 0.0500–0.5011
0.3190 0.1271 0.0500–0.5015
0.3671 0.0829 0.1635–0.4938
Significant
M SD Range
0.6981 0.1314 0.5017–0.9997
0.8303 0.1428 0.5016–1.0000
0.9657 0.0689 0.5020–1.0000
Note. aThe observed power is estimated from the observed effect size, sample size n = 64, and a = .05. Numbers in parentheses are prospective power estimated by SAS PROC POWER.
b
Methodology 2011; Vol. 7(3):81–87
Ó 2011 Hogrefe Publishing
S. Sun et al.: Rethinking Observed Power
85
Table 2. Observed effect sizes for significant and nonsignificant tests Population effect size Test results
0.2
Nonsignificant
M SD Range
Significant
M SD Range
0.1649 0.0988 0–0.3497 0.4537 0.0867 0.3499–0.9561
0.5
0.8
0.2556 0.0773 0.0008–0.3498
0.2868 0.041 0.1733–0.3463
0.5663 0.1411 0.3498–1.2940
0.806 0.1814 0.3500–1.5721
Figure 2. Relationship between observed power and observed effect size.
Prospective Power and Retrospective Power Prospective power was estimated using SAS PROC POWER. Based on Cohen’s original definition of retrospective power (Cohen, 1962), retrospective power is defined as the proportions of significant tests among 10,000 replications. Table 3 shows that the retrospective power is approximately equal to the prospective power. The present study only has 10,000 replications for each population effect size which can explain the slight discrepancy between prospective power and retrospective power. With a large enough number of replications, retrospective power should be exactly equal to prospective power. This finding implies that as long as an adequate sample size is selected
Table 3. Prospective powera and retrospective powerb from 10,000 replications Population effect size
Prospective power Retrospective power
0.2
0.5
0.8
0.2022 0.1967
0.8015 0.8026
0.9943 0.9954
Note. aEstimated from the population effect size, sample size n = 64, and a = .05. b The proportion of significant tests in 10,000 replications. Ó 2011 Hogrefe Publishing
to detect a meaningful and important hypothetical population effect size through prospective power analysis, the test will have good retrospective power based on multiple replications, and researchers can eliminate the possibility that the nonsignificant result is due to inadequate sample size. In other words, an estimate of adequate prospective statistical power before the test in combination with an interpretation of estimated effect sizes after the test is a good way to explain statistically nonsignificant test results. In reality, a sample of adequate size may not be able to obtain, but it is still necessary to do the prospective power analysis and estimate the overall power of the test with the available sample size.
Conclusions and Implications The review of the 14 published empirical studies suggested that the observed power estimates may not have been appropriately interpreted. Both the 14 published studies and the simulation results showed that the observed power based on an observed effect is always low for a nonsignificant test result; without looking at the obtained effect size, it may not be appropriate to conclude that the sample size is inadequate. The logic of recommending observed power analysis for nonsignificant tests is to seek further evidence to verify Methodology 2011; Vol. 7(3):81–87
86
S. Sun et al.: Rethinking Observed Power
whether the statistical decision is correct or an error. The present paper showed that relying on observed power estimates only one may not be able to make such a decision. A lot of good work has been done to explore less biased observed power estimators (Gerard et al., 1998; Johnson, Kotz, & Balakrishnan, 1995; Yuan & Maxwell, 2005), but no satisfactory estimators can be found. Further efforts from research methodologists in this direction may still be necessary. Quantitative researchers’ true interest is to estimate the population parameters and population effect sizes, and they can use prospective power analysis to select an adequate sample size to detect a practically meaningful effect size in a single study. True population power can be estimated using the approach of Cohen’s retrospective power (Cohen, 1992), which is power estimation from a fixed sample size, a preset significance level a, and a predicted population effect size. It can provide some indirect evidence to the existence of true population parameter and effect size. The present study has several important implications. First, conducting power analysis to estimate an adequate sample size is very critical and helpful in research planning phase. Cohen’s work (1988) can be used as the guideline. If the test is already done, the hypothetical population effect, rather than the observed sample effect, should be used to estimate the power of the test (O’Keefe, 2007). Second, when significance cannot be achieved, the authors should focus on multiple sources, such as research design, instruments, and comparison of effect sizes across studies, to explore the possible reasons for the nonsignificance. Reporting confidence intervals and observed effect size are generally recommended to interpret nonsignificant results instead of observed power (Colegrave & Ruxton, 2003; Cumming & Finch, 2001; Goodman & Berlin, 1994; Hoenig & Heisey, 2001; Smith & Bates, 1992; Zumbo & Hubley, 1998). In social science research, replication has not been given enough attention as in the natural sciences (Henson, 2006; Kline, 2004). The nature of social and behavioral science research determines that it is less likely to be replicable than natural science research (Hedges, 1987). Moreover, editorial preference for novelty, uniqueness, and original work may discourage replications (Kline, 2004). Although exact or literal replication where all of the major aspects of the original study are closely reproduced is almost impossible in practice, other types of external replication such as operational replication that only copies the sampling and design of the original studies are feasible. When researchers use inferential statistics to analyze the data, their intention is to generalize the result beyond the sample in their study and make inference about the corresponding population. This makes possible the construct replication that looks for ‘‘the construct validity of the original finding – that is, whether it has generality beyond situation studied in the original work’’ (Kline, 2004, p. 241). The basic ideas of sampling distribution, Type I error, Type II error, and hypothesis testing are built on replication over repeated samples. The result from a single study really tells researchers very limited and fallible information about the study, but the truth will emerge over repeated samples. Therefore, it is time for social and behavioral scientists to give replication its due. Related studies in the social and Methodology 2011; Vol. 7(3):81–87
behavioral sciences can be considered as construct replications, thus explaining the heterogeneity of the results becomes very important for the sake of identifying generalizability. Meta-analysis as a quantitative research synthesis tool can serve this purpose. Whenever a new research study is being planned, researchers should use the results from previous meta-analytical studies, if available, to guide their research design. Whether previous meta-analysis is available or not, new research has to be done with researchers’ metaanalytical thinking. Meta-analytical thinking does not overemphasize the outcomes of statistical tests in individual studies (Kline, 2004). Instead, it emphasizes the need to explicitly design and place the studies in the context of the effects of prior research and compare the new results to the previous findings. Researchers do not have to conduct a formal meta-analysis, but they should develop the capacity to think meta-analytically about their own research (Thompson, 2002).
References American Psychological Association. (2001). Publication manual of the American Psychological Association (5th ed.). Washington, DC: American Psychological Association. American Psychological Association. (2010). Publication manual of the American Psychological Association (6th ed.). Washington, DC: American Psychological Association. Brennan, D. J., Hellerstedt, W. L., Ross, M. W., & Welles, S. L. (2007). History of childhood sexual abuse and HIV risk behaviors in homosexual and bisexual men. American Journal of Public Health, 97, 1107–1112. Chen, C., & Bradshaw, A. C. (2007). The effect of web-based question prompts on scaffolding knowledge integration and ill-structured problem solving. Journal of Research on Technology in Education, 39, 359–375. Chyung, S. Y. (2007). Age and gender differences in online behavior, self-efficacy, and academic performance. The Quarterly Review of Distance Education, 8, 213–222. Cohen, J. (1962). The statistical power of abnormal social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145–153. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Cohen, J. (1992). Statistical power analysis. Current Directions in Psychological Science, 1, 98–101. Colegrave, N., & Ruxton, G. D. (2003). Confidence intervals are a more useful complement to nonsignificant tests than are power calculations. Behavioral Ecology, 14, 446–447. Cumming, G., & Finch, S. (2001). A primer on the understanding, use, and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement, 1, 532–574. Eigsti, I., Bennetto, L., & Dadlani, M. B. (2007). Beyond pragmatics: Morphosyntactic development in autism. Journal of Autism and Development Disorder, 37, 1007–1023. Froman, T., & Shneyderman, A. (2004). Replicability reconsidered: An excessive range of possibilities. Understanding Statistics, 3, 365–373. Gerard, P. D., Smith, D. R., & Weerakkody, G. (1998). Limits of retrospective power analysis. The Journal of Wildlife Management, 62, 801–807. Gillett, R. (1994). Post hoc power analysis. Journal of Applied Psychology, 79, 783–785. Goodman, S. N., & Berlin, J. A. (1994). The use of predicted confidence intervals when planning experiments and the Ó 2011 Hogrefe Publishing
S. Sun et al.: Rethinking Observed Power
misuse of power when interpreting results. Annals of Internal Medicine, 121, 200–206. Hayes, J. P., & Steidl, R. J. (1997). Statistical power analysis and amphibian population trend. Conservation Biology, 11, 273– 275. Heath, N. L., Toste, J. R., Nedecheva, T., & Charlebois, A. (2008). An examination of nonsuicidal self-injury among college students. Journal of Mental Health Counseling, 30, 137–156. Hedges, L. V. (1987). How hard is hard science, how soft is soft science? American Psychologist, 42, 443–455. Henson, R. K. (2006). Effect size measures and meta-analytic thinking in counseling psychology research. The Counseling Psychologist, 34, 601–629. Heslin, P. A., Vandewall, D., & Latham, G. P. (2006). Keen to help? Managers’ implicit person theories and their subsequent employee coaching. Personnel Psychology, 59, 871–902. Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician, 55, 19–24. Johnson, N. L., Kotz, S., & Balakrishnan, N. (1995). Continuous univariate distributions (2nd ed.). New York, NY: Wiley. Kahan, T. A., & Mathis, K. M. (2007). Searching under cups for clues about memory – An online demonstration. Teaching of Psychology, 34, 124–128. Kline, R. B. (2004). Beyond significance testing: Reforming data analysis methods in behavioral research. Washington, DC: American Psychological Association. Levine, M., & Ensom, M. H. H. (2001). Post hoc power analysis: An idea whose time has passed? Pharmacotherapy, 21, 405–409. Mamede, S., Schmidt, H., Rikers, R., Penaforte, J. C., & CoelhoFilho, J. M. (2007). Breaking down automaticity: Case ambiguity and the shift to reflective approaches in clinical reasoning. Medical Education, 41, 1185–1192. Mazefsky, C. A., & Oswald, D. P. (2007). Emotion perception in Asperger’s syndrome and high-functioning autism: The importance of diagnostic criteria and cue intensity. Journal of Autism and Development Disorder, 37, 1086–1095. O’Keefe, K. J. (2007). Post hoc power, observed power, a priori power, retrospective power, prospective power, achieved power: Sorting out appropriate uses of statistical power analyses. Communication Methods and Measures, 1, 291–299. Onwuegbuzie, A. J., & Leech, N. L. (2004). Post hoc power: A concept whose time has come. Understanding Statistics, 3, 201–230. Reed, J. M., & Blaustein, A. R. (1995). Assessment of nondeclining amphibian population using power analysis. Conservation Biology, 9, 1299–1300. Reynolds, C. A., Baker, L. A., & Pedersen, N. L. (2000). Multivariate models of mixed assortment: Phenotypic assortment and social homogamy for education and fluid ability. Behavior Genetics, 30, 455–76.
Ó 2011 Hogrefe Publishing
87
Rezvan, S., Ahmadi, S. A., & Abedi, M. R. (2006). The effects of metacognitive training on the academic achievement and happiness of Esfahan University conditional students. Counselling Psychology Quarterly, 19, 415–428. Schorr, E. A. (2006). Early cochlear implant experience and emotional functioning during childhood: Loneliness in middle and late childhood. The Volta Review, 106, 365–379. Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies? Psychological Bulletin, 105, 309–316. Serlin, R. C. (2000). Testing for robustness in Monte Carlo studies. Psychological Methods, 5, 230–240. Smith, A. H., & Bates, M. N. (1992). Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies. Epidemiology, 3, 449–452. Southam-Gerow, M. A., Silverman, W. K., & Kendall, P. C. (2006). Client similarities and differences in two childhood anxiety disorders research clinics. Journal of Clinical Child and Adolescent Psychology, 35, 528–538. Spatariu, A., Hartley, K., Schraw, G., Bendixen, L. D., & Quinn, L. F. (2007). The influence of the discussion leader procedure on the quality of arguments in online discussion. Journal of Educational Computing Research, 37, 83–103. Steidl, R. J., Hayes, J. P., & Schauber, E. (1997). Statistical power analysis in wildlife research. The Journal of Wildlife Management, 61, 270–279. Thomas, L. (1997). Retrospective power analysis. Conservation Biology, 11, 276–280. Thompson, B. (2002). What future quantitative social science research could look like: Confidence intervals for effect sizes. Educational Researcher, 31, 25–32. Yuan, K.-H., & Maxwell, S. (2005). On the post hoc power in testing mean differences. Journal of Educational and Behavioral Statistics, 30, 141–167. Zumbo, B. D., & Hubley, A. M. (1998). A note on misconceptions concerning prospective and retrospective power. The Statistician, 47, 385–388.
Shuyan Sun Educational Studies Program School of Education Dyer Hall 475E University of Cincinnati P.O. Box 210049 Cincinnati, OH 45221 USA Tel. +1 513 608-3799 Fax +1 513 556-3535 E-mail
[email protected]
Methodology 2011; Vol. 7(3):81–87