p values, neglect of effect size, and inflation of Type I error rate. The impressive frequency of these problems is documented, and changes in statistical practices ...
Copyright 1994 by the American Psychological Association, Inc. 0022-006X/94/S3.00
Journal of Consulting and Clinical Psychology 1994, Vol. 62. No. 1.75-82
METHODOLOGICAL DEVELOPMENTS
Misuse of Statistical Tests in Three Decades of Psychotherapy Research Reuven Dar, Ronald C. Serlin, and Haim Omer This article reviews the misuse of statistical tests in psychotherapy research studies published in the Journal of'Consulting and Clinical Psychology'in the years 1967-1968, 1977-1978, and 1987-1988. It focuses on 3 major problems in statistical practice: inappropriate uses of null hypothesis tests and p values, neglect of effect size, and inflation of Type I error rate. The impressive frequency of these problems is documented, and changes in statistical practices over the past 3 decades are interpreted in light of trends in psychotherapy research. The article concludes with practical suggestions for rational application of statistical tests.
associated methodology has been so refractory to criticism? Two factors may have contributed to this puzzling phenomenon. First, it is possible that researchers and editors fail to connect abstract statistical arguments with what they are actually doing. One goal of this article, therefore, is to sample the actual statistical testing practices in psychological research, thereby making the discussion of common errors and the proposal of rational alternatives more tangible and relevant. Second, we believe that statistical testing practices are intimately linked to general attitudes toward theory and research. Recently, Omer and Dar (1992) documented a shift from theoretical to pragmatic interests during the past 3 decades of psychotherapy research. Thus, theoretical questions about the mechanisms underlying therapeutic change have been discarded in favor of practical questions concerning the effectiveness of specific treatments and diagnostic methods. On the one hand, this shift was manifested by a rise in the standards of clinical validity, but on the other hand, it was manifested by a decline in the role of theoretical rationales and predictions. In the absence of clear theoretical commitments and predictions, we suggest that the prevailing attitude to research should become one of "let us see what turns out." Such an attitude may undermine the basis of statistical tests, which cannot be used in the absence of a priori theoretical hypotheses. Therefore, the pragmatic shift may contribute directly to the puzzling tenacity of flawed null hypothesis testing methodology in psychological research. Our choice of psychotherapy research as the area in which to examine and document the use of statistical tests does not imply that this field is uniquely guilty of misusing this methodology. Rather, our choice was determined by two considerations: First, psychotherapy is one of the central areas of psychological research and one that has both theoretical and pragmatic interests; second, this choice allowed us to relate our findings to the pragmatic shift in psychotherapy research reported by Omer and Dar (1992). We concentrated on three major problems in statistical prac-
For Haifa century, articles criticizing the methodology of traditional null hypothesis testing (where the null hypothesis posits equality of means, zero correlation, etc.) have appeared regularly in journals of the social sciences, especially with regard to research in psychology. The most severe criticisms questioned the basic rationale of this methodology (e.g., Berkson, 1942; Carver, 1978; Rozeboom, 1960), some going as far as to blame it for the slow progress in psychology (Dar, 1987; Meehl, 1978). Most of these critics advocated abolishing null hypothesis tests altogether, but these recommendations have had little impact: Null hypothesis tests seem to be an institutionalized tradition (Gigerenzer, 1987), unlikely to forgo its central role in psychological research methodology. Others critics, rather than urging to do away with null hypothesis tests, have suggested various adaptations devised to make hypothesis testing methodology more rational, such as making statistical tests tougher (Serlin & Lapsley, 1985), eliminating their decisional role (Folger, 1989), and paying more attention to considerations of power (Cohen, 1969). Strikingly, however, these proposals also seem to have had little impact on actual practice in psychology (Sedlmeier & Gigerenzer, 1989). It is possible that researchers are simply not interested in revising the methodology of null hypothesis tests; still, one must assume that they would not knowingly and maliciously continue to use flawed statistical procedures. Furthermore, there appears to be no reason for journal editors, who have a major influence on scientific practices, to participate in a conspiracy for the perpetuation of spurious findings. How is it, then, that misuse of null hypothesis tests and the
Reuven Dar and Haim Omer, Department of Psychology, Tel Aviv University, Tel Aviv, Israel; Ronald C. Serlin, Department of Educational Psychology, University of Wisconsin-Madison. Correspondence concerning this article should be addressed to Reuven Dar, Department of Psychology, Tel Aviv University, Ramat Aviv Tel Aviv, 69978 Israel.
75
76
R. DAR, R. SERLIN, AND H. OMER
tice: inappropriate uses of null hypothesis tests and p values, neglect of effect size, and inflation of Type I error rate. These problems have all been described by others (e.g., Carver, 1978; Rosnow & Rosenthal, 1989); our goal here is not to unearth new errors but to relate statistical criticism to actual practice as well as to general trends in psychological research.
documented the use of an analysis of variance (ANOVA) to prove equality of groups at baseline, which is, as noted earlier, an example of wrongly attempting to confirm the null hypothesis.
Inappropriate Uses of Null Hypothesis Tests and p Values
It seems obvious that there is no contradiction between null hypothesis tests and consideration of effect size. Indeed, Serlin and Lapsley's (1985) good-enough principle integrates effect size into existing significance testing methodology. Nevertheless, the two practices have been presented as if there were a choice involved: "Significance test or effect size?" (Chow, 1988). We suggest that the practice of examining actual p values for their degree of significance may have led to a neglect of effect size in psychological research. If a p value of .0002 means a wildly significant effect, who needs to look at the percent of variance accounted for? This again illustrates the mistaken interpretation of p values as measures of validity, replicability, or importance of findings. We believe that consideration of effect size is crucial not only for pragmatic research, in which one attempts to assess the value of population parameters or the actual contribution of a given treatment, but for theoretical research. This is so because, as Meehl (1978) and others have noted, the null hypothesis is never literally true in the social sciences, where everything is (at least minimally) related to everything else. Traditional null hypothesis methodology, therefore, subjects theories to extremely weak tests, violating the standards set by modern philosophy of science (e.g., Lakatos, 1970). If this tendency is to be overcome, it is essential that effect size becomes an integral part of theory testing. A traditional way of including the effect size in the statistical decision is by examining its associated confidence interval, which indicates a range within which a true effect may be presumed to lie when sampling variability is considered. It was previously reported that the pragmatic shift in psychotherapy research has led to a more frequent consideration of clinical significance (Omer & Dar, 1992). In this article, we examined the prevalence of other effect size considerations in the articles we surveyed. To prevent ambiguity, we defined effect size in a very broad sense, including effect estimates in standard deviation units, comparisons to norms, or proportion of explained variance, as well as reporting of confidence intervals.
The logic of null hypothesis tests is as follows: Assume the null hypothesis is true, and then examine the likelihood of obtaining the sample results based on this assumption; if this likelihood is lower than a specified criterion (the alpha level), reject the null hypothesis. Therefore, such tests cannot be used to confirm the null hypothesis. We can only estimate the probability of obtaining the data given the truth of the null hypothesis, not the probability of the truth of null hypothesis given the data. One common example of using statistical tests to confirm the null hypothesis is concluding that groups were equivalent at baseline by showing that there were no statistically significant differences between them. The sample p value, in the context of null hypothesis testing, is involved in a binary decision—specifically, in determining whether the predetermined criterion of Type I error, the alpha level, has been surpassed. Researchers, however, often make much more of the p value, turning it into the oracle of truth. Thus, they may linger over the obtained p values, honoring small values with three asterisks and with expressions such as "highly significant" or with happy exclamation marks. On the other hand, in their wish to escape the tragic predicament of losing a publication because of painful near misses, they cite "marginally significant" and "borderline significant" effects. The reporting of highly and marginally significant results are equally misleading; they reflect the false belief that p measures the validity or strength of the results (see Carver, 1978) rather than merely the probability of the results given the truth of the null hypothesis. Several authors (e.g., Rosnow & Rosenthal, 1989; Skipper, Guenther, & Nass, 1967) suggested using p values descriptively rather than inferentially. With this view, exact p values can be reported as such, with no regard to a predetermined alpha level and without the inferential consequence of statistical significance, for "surely, God loves the .06 nearly as much as the .05" (Rosnow & Rosenthal, 1989, p. 1277). Clearly, however, researchers cannot use the obtained p values descriptively and endow them with binary inferential significance. This attempt to have the cake and eat it too is exemplified by the practice of taking any obtained p value and rounding it upward, thus creating post hoc an impression of a predetermined alpha level (e.g., a p value of .0027 is reported as p < .005), a practice that results in numerous pseudo-alpha levels reported in a single study. We examined the aforementioned problems by counting the number of different pseudo-alpha levels reported in the articles we sampled, by checking whether exact p values were reported in the context of null hypothesis tests, and by recording whether and how borderline significant effects were claimed. We also
Neglect of Effect Size
Inflation of Type I Error Rate The need to control for the overall risk of Type I error, rather than treating each statistical test as an independent experiment, is stressed in most introductory texts on statistical analysis (e.g., Keppel, 1973). Controlling for inflation of Type I error rate is essential not only for statistical reasons; it also prevents reckless data mining and leads to responsible research planning. The correct apportioning of alpha to prevent inflation of Type I error rate requires that researchers conceptualize and define their families of hypotheses, an issue we discuss more fully later, and consider their predictions carefully, as each additional statistical test further reduces the per-test alpha level. Specific procedures to control for experimentwise or familywise rate of Type I error
MISUSE OF STATISTICAL TESTS
have been long available (e.g., Dunn, 1961). It has been our impression, however, that they are often ignored by researchers. In this study, we documented the frequency with which alpha levels were adjusted for multiple tests in our sample of psychotherapy research, and we documented which procedures were used for that purpose. We also documented other practices that may inflate Type I error, such as following multivariate ANOVAs (MANOVAs) with univariate ANOVAs (Bird & Had/i-Pavlovic, 1983), following multiple regression with univariate tests of significance for regression slopes, and following significant interactions with null hypothesis tests of simple effects without adjusting the alpha level for the number of simple effects tested. Method In this report, we used the same database and the same rating procedure as those used in the recent report on trends in psychotherapy research (Omer & Dar, 1992). We reviewed all regular research articles on psychotherapy (broadly denned as the treatment of any dysfunction by psychological means) that appeared in the Journal of Consulting and Clinical Psychology (JCCP) in the years 1967-1968 (the 60s; n = 27), 1977-1978 (the 70s; n = 91), and 1987-1988 (the 80s; n = 45), a total of 163 articles. (We refer to these time periods as the 60s, 70s, and 80s for the sake of convenience and consistency with Omer and Dar [ 1992].) In contrast to Omer and Dar (1992), case studies and brief reports were excluded, because statistical procedures are rarely used or reported in full in these types of studies. The following variables were rated for each article: inappropriate uses of null hypothesis tests and p values, neglect of effect sizes, and inflation of Type I error. We counted the different alpha levels cited in each study and noted whether actual obtained p values were reported. We checked whether borderline effects were discussed and whether an ANOVA was used to demonstrate lack of baseline differences. We checked which, if any, measures of effect size were mentioned, including confidence intervals, various measures of percent of explained variance and degree of clinical significance. We examined whether any compensation for multiple statistical tests was used, as well as the type of compensation procedure. We also documented other practices (discussed earlier) that may inflate Type I error, especially the ways in which omnibus and multivariate procedures were followed.
Results Inappropriate Uses of Null Hypothesis Tests andp Values We stressed earlier that, for the inferential game to be played fairly, an alpha level must be chosen a priori, and the statistical decision must be binary. We found that very few studies in psychotherapy used a predetermined, single alpha level. Instead, the studies we examined mentioned up to 20 pseudo-alpha levels, ranging from .15 to .00001 (see Figure 1). The mean number of different pseudo-alpha levels used in a single study rose from 2.59 (SD = 1.08) in the 60s to 4.00 (SD = 2.41) in the 70s to 4.56 (SD = 3.33) in the 80s, paralleling the pragmatic shift; indeed, we found only one study in the 80s that used a single alpha level. A planned contrast comparing the 60s with the 70s and 80s (combined) confirmed this trend, 1(156) = 3.37, p < .005.' The problems associated with the reporting of multiple
77
pseudo-alpha levels were exacerbated in a sizable proportion of studies in the 70s and 80s (12.1% and 15.9%, respectively) by the use of exact p values in addition to, or instead of, a comparison with a criterion (e.g., p = .00423, rather than p < .05). In violation of the binary nature of statistical tests, discussed earlier, 46.3% of all studies (40.7% in the 60s, 48.4% in the 70s, and 45.5% in the 80s) included and discussed effects that were nonsignificant but close to the implicit criterion of .05. Such effects were described in these studies as marginal, trends, tendencies, borderline significant, approaching significance, near significant, or almost significant. A p value of .06 was described as a strong trend, whereas a p val^e of. 10 was labeled as a stable tendency. A painful near miss of p = .055 was described as an elaborate "trend toward an effect that nearly reached the conventional significant level." We found that p values of up to . 15 were considered a trend, whereas a p value of. 18 received only a modest description of "in the hypothesized direction." A few cases were even more extreme: A / test that resulted in ap value of .24(1) was interpreted to indicate that one group showed more change although it did not reach statistical significance. The wish to have the cake and eat it too, mentioned earlier, is best exemplified in the following quote from a study conducted in the 70s: "Children with outgoing volunteers changed more than children with quiet volunteers (p = .10), and this difference is significant if the comparison is considered a one-tailed test of a prior hypothesis." Needless to say, no such prior hypothesis was mentioned in the study. We found that p values received an absurdly central position in many of the studies we examined, often at the expense of descriptive statistics. In a study of weight loss, for example, authors chose to mention in the abstract the imposing p value of .001 rather than any measure of weight loss. In another study, a special table was included that contained only asterisks representing decreasing pseudo-alpha levels, from one humble * to a mind-boggling ****. Finally, one article we sampled contained only p values; no means, standard deviations, correlation coefficients, or any other statistics were mentioned. The inappropriate use of an ANOVA to prove equality of groups at baseline, described earlier as a case of falsely attempting to confirm the null hypothesis, rose sharply from only one study in the 60s to 37.4% of studies in the 70s and 53.3% of studies in the 80s (the correlation between decade and proportion of studies using this procedure was .32, p < .005). Nonsignificant results of such ANOVAs were interpreted to mean that the groups were statistically equivalent.
Neglect of Effect Size As we noted, a test of zero effect provides very little information, and an estimate of effect size must become a part of the confirmation process. We found that, in general, effect size is increasingly being discussed in psychotherapy research. Whereas only 14.8% of studies in the 60s explicitly reported 1 We chose the conservative alpha level of .005 to control the overall Type I error rate for the number of planned contrasts and correlations reported.
78
R. DAR, R. SERLIN, AND H. OMER
1
2
3
4
5
6
7
8
9
10 More Than 1 0
NUMBER OF DIFFERENT PSEUDO-ALPHA LEVELS Figure 1. Bar graph indicating the distribution of the number of pseudo-alpha levels per article.
effect sizes, this proportion grew to 29.7% in the 70s, and by the 80s 61.4% of the studies included effect size measures (the correlation between decade and proportion of studies reporting effect size was .33; p < .005). This parallels the rise in using explicit, predetermined criteria for clinical significance in pragmatic research, documented by Omer and Dar (1992; see Figure 2). Effect size estimates in the studies we reviewed included squared multiple correlation coefficients and differences in proportions or means. However, we found that, in most cases, no effect size estimates were reported at all. Specifically, no measures of effect size (i.e., eta or omega squared or other measures of the percent of variance accounted for by the independent variables) were ever reported in the context of an ANOVA. Even more importantly, none of the 163 studies reported confidence intervals, which, as explained earlier, are necessary for taking sampling error into account when estimating population values.
Inflation of Type I Error As mentioned earlier, it is necessary not only to specify the alpha level in advance but also to use techniques that control for Type I error rate at the specified level. We found that only a minority of studies (26 of 111, or 23.4%) compensated even minimally for the performance of multiple statistical tests such as multiple t tests, contrasts, or correlations. On the positive side, we found that this proportion is on the rise, from 11.8% in the 60s to 16.4% in the 70s and to 38.5% in the 80s (the correla-
tion between decade and proportion of studies using some compensation procedure was .24, p < .005). Nevertheless, the somber implication of these findings is that, in up to three fourths of the studies overall and in two thirds of the studies in the past decade, there has been a proliferation of falsely rejected null hypotheses. A striking finding in the articles we reviewed is the dominance of ANOVA as the preferred data analysis technique in psychotherapy research. Omnibus analyses, including ANOVA, analysis of covariance (ANCOVA), multivariate analysis of variance (MANOVA), and multivariate analysis of covariance (MANCOVA), were included in 82.9% of the studies we reviewed. Among these, M ANOVAs are becoming ever more popular (Figure 3), whereas in the 60s, only a single study used MANOVA and none used MANCOVA; by the 80s, 29 studies (64.4%) included these multivariate techniques (the correlation between decade and proportion of studies using MANOVA or MANCOVA was .39, p < .005). The ubiquitous use of ANOVAs, especially the multivariate techniques, is related to several practices that may also inflate Type I error rate. A sizable number of studies (n = 59) reported post hoc contrasts following ANOVA. _The majority of those studies, however (n = 46), used procedures that either are more appropriate for planned comparisons (e.g., the Duncan procedure, which was used in 16 cases) or do not conserve Type I error rate (e.g., the Neuman-Keuls procedure, used in 11 cases). Similarly, we found that the practice of using univariate ANOVAs after MANOVAs is almost universal in these studies:
79
MISUSE OF STATISTICAL TESTS
70
EFFECT SIZE D CLINICAL SIGNIFICANCE
60
D
40
530
o cc UJ
a 20 10
60s
70s
80s
Figure 2. Bar graph indicating the frequency of reporting effect size and clinical significance by time period. 60s = the years 1967-1968; 70s = the years 1977-1978; 80s = the years 1987-1988.
Among studies using MANOVAs (n = 28), all but one reported this inappropriate procedure. The similarly flawed practice of using univariate tests of significance for regression slopes after multiple regression was found in five of the seven studies using multiple regression. Finally, the practice of using null hypothesis tests of simple effects after a significant test of an interaction, without adjusting the alpha level for the number of simple effects tested, was found in 58.6% of the studies that detected significant interactions (n = 70). Discussion The results may be examined from two perspectives: what they tell about the misuse of statistical tests in general and what trends regarding this misuse can be discerned through the three time periods examined. In general, null hypothesis tests have been misinterpreted and misapplied all along; researchers give an impression of playing the binary decisional game by using terms such as tests and significance level, but in actuality, they disregard its rules by overemphasizing and misinterpreting p values, using multiple pseudo-alpha levels, flirting with borderline/? values, and inflating Type I error rates in the great majority of studies. A comparison among the 3 decades we sampled does not always show a coherent picture. For example, although there is an
increased effort to control for inflation of Type I error in the context of multiple tests, practices related to ANOVAs have contributed to a decreased protection of the alpha level over the years. Two trends, however, can be clearly discerned when the three time periods are compared. We documented a rise in considerations of effect size, including clinical significance, in psychotherapy studies; we also found a mushrooming of omnibus multivariate techniques and a growing misuse of null hypothesis tests. How can one account for these seemingly opposite trends? Our view is that both trends are related to "the flight from theory into pragmatics" (Omer & Dar, 1992). We suggest that, as psychotherapy research has become more centered on questions of clinical validity and less interested in conceptual matters, two consequences have followed: On the one hand, the trend toward pragmatic, applied research has led to an increased demand for clinically meaningful findings; on the other hand, the decline of theoretical guidance and predictions has led to a decline in planned statistical tests and to increased use of exploratory procedures, especially omnibus MANOVAs, which were made easy to perform by the new and fast computers and statistical packages. Clearly, however, this view is quite speculative; giving a more central role to the development of computers and simply giving such a role to changes in the preferences of specific editors ofJCCP are certainly viable alternative explanations. An
80
R. DAR, R. SERLIN, AND H. OMER
UNIVARIATE [ZJ MULTIVARIATE
60s
70s
80s
Figure 3. Bar graph indicating the frequency of univariate and multivariate analyses of variance by time period. 60s = the years 1967-1968; 70s = the years 1977-1978; 80s = the years 1987-1988.
unequivocal conclusion from this study, at any rate, is that much of the current use of statistical tests is flawed, and the following suggestions for a reform of statistical practice are intended to remedy some of these flaws and suggest a more rational and defensible data analysis strategy. First and foremost, we strongly believe in the importance of theory in guiding research: "When theory does not play a selective role, our data-gathering activities belong to the realm of journalism rather than science" (Kukla, 1989, p. 794). The task of the researcher becomes much clearer when studies are derived from theory and specific predictions are made. Clear questions necessarily lead to a studied choice of measures, subjects, and design. Furthermore, predictions make statistical tests more powerful by allowing for planned rather than post hoc statistical procedures, and, typically, for one-tailed rather than two-tailed tests. The statistical choices must be a direct reflection of the research questions. If the questions involve simple effects, so should the analysis; if they involve interactions, so should the statistical tests. In choosing a statistical strategy, we agree with Cohen that less is more (except for sample size) and simple is better (Cohen, 1990, pp. 1304-1305). As Rosnow and Rosenthai (1989) pointed out recently, contrast analysis is preferable to the ubiquitous omnibmus tests. An ANOVA may result in nonrejection of the null hypothesis, even in a situation where there is an evident (and predicted) pattern of differences among group means; moreover, a significant effect tells us only that
there exists at least one, possibly uninterpretable, significant contrast among group means, which is a piece of information typically of doubtful value. Omnibus multivariate techniques, including the MANOVAs and MANCOVAs that we found to be increasingly popular, compound the problems of the univariate ANOVA. Not only are the results of the omnibus test often uninterpretable, as noted earlier, the discriminant function used to combine the dependent variables in these procedures capitalizes on sampling error to maximize the desired effect, and is typicality uninterpretable as well. Next, Type I error rate and the way it is to be apportioned must be determined a priori. Our preference is to control for the error rate at the level of families of hypotheses. With that said, we hasten to add that deciding what constitutes a family is not necessarily an easy task. In the behavioral sciences, and perhaps particularly in psychotherapy research, everything usually correlates with everything else, so that distinct families cannot be expected to be orthogonal. What constitutes a family, therefore, has to be decided by an examination of each particular design. For example, assume one conducts a pragmatic study to examine the hypothesis that imaginal exposure is superior to relaxation as treatment for fear of flying. Assume that the subjects are men with a diagnosis of "flying phobia" and that they are randomly assigned to one of the two treatments. Two constructs, anxiety and coping skills, are the dependent variables of interest; anxiety is measured by self-report (e.g., state-trait
81
MISUSE OF STATISTICAL TESTS
questionnaire), physiological measures (e.g., heart rate), and behavioral measures (e.g., taking a trans-Atlantic commercial flight), and coping skills are measured by subjective reports of self-efficacy and resourcefulness. Assigning an alpha of .05 to the experiment would leave us with a .01 corrected alpha for each of the five measures, a decision that would greatly reduce the power of our tests; however, assigning .05 to each measure would unduly inflate the overall Type I error rate. But what are the families of hypotheses here, and what alternatives are there for partitioning Type I error rate? Our view is that there are two families here that parallel the two questions that guided the design: (a) How do the treatments affect anxiety? (b) How do the treatments affect coping skills? Despite the likelihood that all five measures are intercorrelated, we would assign an alpha of .05 to each family. The absence of a simple rule for deciding which are our families of hypothesis should not drive us, however, into a multivariate frenzy. Using MANOVA, for example, would be inappropriate in this case for at least three reasons. First, as mentioned earlier, the omnibus test will reject the hypothesis in question whenever an arbitrary linear combination of the dependent variables turns out to be significant. We are rarely interested, however, in arbitrary linear combinations of variables. Either we are interested in that sum of variables that best measures the construct in question (in which case the summary measure should be created intelligently by the researcher, possibly with the aid of a principal component analysis) and a univariate test performed on it, or we are interested in the individual variables, in which case each should be examined by a set of univariate analyses. Second, as explained earlier, performing the omnibus test commits the researcher to follow up with contrasts, the significance of which is determined by the Scheffe coefficient, which is typically enormous. And third, performing the omnibus test commits the researcher to two-tailed contrasts, which further reduces the power of the test, compared with the more powerful directional comparisons allowed in a planned analysis. Similar considerations apply to regression analysis, applied to a research hypothesis of relationship between several predictors and a dependent variable. Here, the usual practice of testing each regression coefficient at the .05 level inflates Type I error in exactly the same manner as when MANOVAs are followed by univariate tests. Typically, predictors in regression analysis can be construed as sets that measure conceptually independent constructs, each set forming a family that addresses a different hypothesis under test. Therefore, just as in the ANOVA example, the researcher should assign the chosen Type I error rate to each independent variable family, and univariate analyses can be performed within family by dividing the Type I error rate by the number of variables in the family. Except in the most extraordinary of circumstances too awful to describe here, no researcher should ever base a conclusion as to theory confirmation or treatment effectiveness on stepwise regression. This procedure, made popular by statistical packages, capitalizes on chance in selecting and ordering the variables that are entered as predictors. Not only is this the worst case of the researcher taking no responsibility for his or her own results, it has also been shown recently (Hurvich & Tsai, 1990)
that when tests are performed following screening procedures like the stepwise method, the Type I error rate is again unacceptably inflated. Therefore, for both theoretical and statistical reasons, the researcher should always select predictors and determine their order of entry into the regression analysis in advance. Finally, as we argued earlier, effect size must be taken into consideration not only in determining clinical significance but also in the context of statistical tests. Serlin and Lapsley's (1985) good-enough principle offers a way of integrating effect size into statistical tests by testing the researchers' hypothesis not against zero, but rather against a predetermined interval. This methodology, however, may deter researchers because of its complexity, and it is not yet readily applicable in many designs. A minimal alternative, as we mentioned, is to consider confidence intervals when judging obtained effects. In drawing boundaries around obtained effects, confidence intervals provide essential information when estimating effect sizes in the populations. We believe that this general approach to statistical analysis would help make it as sound as the rest of the methodology used in psychotherapy research projects. As others have urged, however (Sedlmeier & Gigerenzer, 1989), journal editors, who were the force that helped institutionalize null hypothesis tests, must adopt these and other recommendations into their editorial policy for them to have any impact on the field. References Berkson, J. (1942). The test of significance considered as evidence. Journal of the American Statistical Association, 37, 325-333. Bird, K. D., & Hadzi-Pavlovic, D. (1983). Simultaneous test procedures and the choice of a test statistic in MANOVA. Psychological Bulletin, 93, 167-178. Carver, R. P. (1978). The case against statistical significance testing. Harvard Educational Review, 48, 378-399. Chow, S. L. (1988). Significance test or effect size? Psychological Bulletin, 103, 105-110. Cohen, J. (1969). Statistical power analysis for the behavioral sciences. San Diego, CA: Academic Press. Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-1312. Dar, R. (1987). Another look at Meehl, Lakatos, and the scientific practices of psychologists. American Psychologist, 42, 145-151. Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56, 52-64. Folger, R. (1989). Significance tests and the duplicity of binary decision. Psychological Bulletin, 106, 155-160. Gigerenzer, G. (1987). Probabilistic thinking and the fight against subjectivity. In L. Kruger, G. Gigerenzer, & M. S. Morgan (Eds.), The probabilistic revolution, Vol 2: Ideas in the sciences (pp. 11-33). Cambridge, MA: MIT Press. Hurvich, C. M., & Tsai, C.-M. (1990). The impact of model selection on inference in linear regression. The American Statistician, 44, 214217. Keppel, G. (1973). Design and analysis: A researcher's handbook. Englewood Cliffs, NJ: Prentice Hall. Kukla, A. (1989). Nonempirical issues in psychology. American Psychologist, 44, 785-794. Lakatos, I. (1970). Falsification and the methodology of scientific research programmes. In I. Lakatos & A. Musgrave (Eds.), Criticism and the growth of knowledge (pp. 91-196). Cambridge, England: Cambridge University Press.
82
R. DAR, R. SERLIN, AND H. OMER
Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834. Omer, H., & Dar, R. (1992). Changing trends in three decades of psychotherapy research: The flight from theory into pragmatics. Journal of Consulting and Clinical Psychology, 60, 88-93. Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276-1284. Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test. Psychological Bulletin, 57, 416-428. Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power
have an effect on the power of studies? Psychological Bulletin, 105, 309-316. Serlin, R. C, & Lapsley, D. K. (1985). Rationality in psychological research: The good-enough principle. American Psychologist, 40, 7383. Skipper, J. K., Guenther, A. L., & Nass, G. (1967). The sacredness of .05: A note concerning the uses of statistical tests of significance in social sciences. The American Sociologist, 2, 16-18.
Received December 9, 1991 Revision received April 28, 1993 Accepted May 5, 1993 •
Six Editors Appointed, 1995-2000 The Publications and Communications Board of the American Psychological Association announces the appointment of six new editors for 6-year terms beginning in 1995. As of January 1, 1994, manuscripts should be directed as follows: •
For the Journal of Abnormal Psychology, submit new manuscripts to Milton E. Strauss, PhD, Department of Psychology, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, Ohio 44106-7123.
•
For the Journal of Applied Psychology, submit manuscripts to Philip Bobko, PhD, Editor, Journal of Applied Psychology, P.O. Box 130, Skillman, New Jersey 08558.
•
For the Journal of Comparative Psychology, submit manuscripts to Charles T. Snowdon, PhD, Department of Psychology, University of Wisconsin, 1202 West Johnson Street, Madison, Wisconsin 53706-1696.
•
For the Attitudes and Social Cognition section of the Journal of Personality and Social Psychology, submit manuscripts to Arie W. Kruglanski, PhD, Department of Psychology, University of Maryland, College Park, Maryland 20742.
•
For Professional Psychology: Research and Practice, submit manuscripts to Patrick H. DeLeon, PhD, JD, Editor, Professional Psychology: Research and Practice, APA, Room 3084, 750 First Street, NE, Washington, DC 20002-4242.
•
For Psychological Review, submit manuscripts to Robert A. Bjork, PhD, Psychological Review, Department of Psychology, University of California, Los Angeles, California 90024-1563.
Manuscript submission patterns make the precise date of completion of 1994 volumes uncertain. The current editors, Susan Mineka, PhD; Neal Schmitt, PhD; Gordon G. Gallup, PhD; Abraham Tesser, PhD; Ursula Delworth, PhD; and Walter Kintsch, PhD, respectively, will receive and consider manuscripts until December 31,1993. Should any 1994 volumes be completed before that date, manuscripts will be redirected to the new editors for consideration in 1995 volumes.