Standard errors in the application of null ... - University at Albany

14 downloads 58 Views 258KB Size Report
SHAWN D. BUSHWAY* and GARY SWEETEN. Department of .... SHAWN D. BUSHWAY ET AL. 2 ...... *Mills, P. E., Cole, K. N., Jenkins, J. R. & Dale, P. S. (2002).
Journal of Experimental Criminology (2006) 2: 1–22 DOI: 10.1007/s11292-005-5129-7

#

Springer 2006

Size matters: Standard errors in the application of null hypothesis significance testing in criminology and criminal justice SHAWN D. BUSHWAY* and GARY SWEETEN Department of Criminology and Criminal Justice, University of Maryland, 2220 LeFrak Hall, College Park, MD, 20742, USA *corresponding author: E-mail: [email protected]

DAVID B. WILSON Administration of Justice Program, George Mason University, Manassas, VA, USA Abstract. Null Hypothesis Significance Testing (NHST) has been a mainstay of the social sciences for empirically examining hypothesized relationships, and the main approach for establishing the importance of empirical results. NHST is the foundation of classical or frequentist statistics. The approach is designed to test the probability of generating the observed data if no relationship exists between the dependent and independent variables of interest, recognizing that the results will vary from sample to sample. This paper is intended to evaluate the state of the criminological and criminal justice literature with respect to the correct application of NHST. We apply a modified version of the instrument used in two reviews of the economics literature by McCloskey and Ziliak to code 82 articles in criminology and criminal justice. We have selected three sources of papers: Criminology, Justice Quarterly, and a recent review of experiments in criminal justice by Farrington and Welsh. We find that most researchers provide the basic information necessary to understand effect sizes and analytical significance in tables which include descriptive statistics and some standardized measure of size (e.g., betas, odds ratios). On the other hand, few of the articles mention statistical power and even fewer discuss the standards by which a finding would be considered large or small. Moreover, less than half of the articles distinguish between analytical significance and statistical significance, and most articles used the term Fsignificance_ in ambiguous ways.

Key words: criminal justice, criminology, Justice Quarterly, regression, review, significance, standard error, testing

Introduction Null Hypothesis Significance Testing (NHST) has been a mainstay of the social sciences for empirically examining hypothesized relationships, and the main approach for establishing the importance of empirical results. NHST is the foundation of classical or frequentist statistics founded by Fisher (the clearest statement is in Fisher 1935), and it has three key steps. First, a null hypothesis of Bno difference^ or Bno relationship^ is established with no specific alternative hypothesis specified. In the second step, a test statistic is calculated under a number of distributional assumptions. Finally, if the probability of obtaining the calculated test statistic is below a certain threshold (typically 0.05 in the social sciences), the

2

SHAWN D. BUSHWAY ET AL.

null hypothesis is rejected. The approach is designed to test the probability of generating the observed data if no relationship exists between the dependent and independent variables of interest, recognizing that the results will vary from sample to sample. A rejection of the null implies that the observed data was not generated due to simple sampling variation, and therefore a true difference or relationship between the key variables exists with only some small chance of Type I error (i.e., concluding that there is a relationship when in fact there is none). NHST has attracted criticism from its inception (e.g., Berkson 1938; Boring 1919) and critics of the approach can be found in psychology (Gigerenzer 1987; Harlow et al. 1997; Rozeboom 1960), medicine (Marks 1997), economics (Arrow 1959; McCloskey and Ziliak 1996), ecology (Anderson et al. 2000; Johnson 1999) and criminology (Maltz 1994; Weisburd et al. 2003). The critics can be prone to flowery hyperbole. Our favorite is from Rozeboom (1997), who stated that B(n)ull hypothesis testing is surely the most bone-headedly misguided procedure ever institutionalized in the rote training of science students^ (p. 335). The critics can be placed into three basic categories. The first group of critics believes that the statistical properties of this approach are flawed. For example, the approach ignores theoretical effect sizes and Type II error. From this perspective, alternative approaches such as Bayesian statistics, NeymanYPearson decision theory, non-parametric tests, and Tukey exploratory techniques should at least supplement if not replace NHST (Maltz 1994; Zellner 2004). A second group of critics takes a more philosophical approach, focusing on the limitations of experimental and correlational designs in social science (Lunt 2004). These critics would prefer more qualitative evidence and theoretical development and less Brank empiricism.^ In contrast to the first two groups, the third group of critics is not interested in replacing NHST, but rather hopes to correct mistakes in the application of NHST.1 This last group focuses on the difference between analytical significance and statistical significance. To these critics, the size of the effect should drive the evaluation of the analysis, not its statistical significance. As such, effects can be analytically uninteresting (i.e., trivially small given the topic at hand) and statistically significant. Effects can also be analytically interesting and statistically non-significant. That is, tests can have low power, such that the researcher cannot reject the null hypothesis for analytically interesting effect sizes (Cook et al. 1979; Lipsey 1990; Weisburd et al. 2003). The goal of these reformers is to persuade researchers to focus on analytical or substantive significance in addition to statistical significance. In economics, this effort has been led by McCloskey and Ziliak (1996; Ziliak and McCloskey 2004) who have conducted high profile reviews of the lead journal in the field (American Economic Review) to document proper and improper usage of the NHST. The movement is more advanced in medicine and psychology (Fidler 2002; Thompson 2004) where revised editorial standards strongly recommend that researchers move away from over-reliance on P-values and become more focused on effect sizes and confidence intervals (APA 2001). A recent issue of the Journal of Socio-Economics devoted to this topic is the first attempt to bring together researchers from multiple disciplines (sociology,

SIZE MATTERS

3

psychology, economics, and ecology) to discuss the problem and possible solutions.2 In this issue, there is marked frustration at the slow adoption of good practice despite the clear consensus about the standards of good practice. One possible reason for the lack of real change is that misuse causes no real harm. Elliott and Granger (2004) and Wooldridge (2004), for example, do not deny that people do a poor job of reporting their results, but nonetheless argue that NHST is useful for theory testing and the intelligent reader usually can decipher the effect sizes for herself. On the other hand, Weisburd et al. (2003) provide a compelling case that the poor use of the NHST can in fact have profound negative consequences. Specifically, they are concerned about researchers who accept a null hypothesis in program evaluation and act as if the effect was actually zero, concluding that the program does not work. They sampled all program evaluations used in the Maryland Crime Prevention Report in which null findings were reported (Sherman et al. 1997). They then investigated whether there was enough statistical power to reject the null hypothesis of a small but substantively meaningful effect size of 0.2 or greater. In slightly less than half of the cases this reasonable effect size could not be rejected, suggesting that the null finding was not substantively the same as an effect size of zero. Or to put it another way, tests with low power to identify reasonable effect sizes were being used to conclude that programs did not work (see also Lipsey et al. 1985, for a similar conclusion). It is hard to argue that such practice is not both bad statistics and harmful for the accumulation of knowledge. It is not surprising that the first formal review of the use of NHST in criminology and criminal justice was done in the context of program evaluation. Thompson (2004) argues that the shift to Bmeta analytical^ thinking (Cumming and Finch 2001), with its focus on effect sizes and replicability, is facilitating a real change in psychology and education research. Replicability is judged by evaluating the stability of effects across a related literature, a comparison that requires the use of standardized effect sizes and does not depend on statistical significance. Meta-analytic thinking has become increasingly common in criminology and criminal justice (Petrosino 2005; Wilson 2001). We believe, however, that the poor practice of NHST goes beyond program evaluation and has negative consequences for knowledge building in all areas of the field. While the change can start in the program evaluation literature, it must ultimately pervade criminology more broadly. This paper is intended to evaluate the state of the criminological and criminal justice literature with respect to the correct application of NHST. Our assessment conceptualizes the application of NHST around four basic issues. The first two concern the size of the coefficients of interest. This broader issue is conceptualized as having two components: reporting the size of an effect (or presenting information necessary to determine the size of an effect), and interpreting the size of an effect, not merely its statistical significance. The third issue concerns the correct interpretation of non-significant effects, such as reporting confidence intervals or power analysis. The fourth issue focuses on basic errors in the application of NHST, such as errors in the specification of the null hypothesis or

4

SHAWN D. BUSHWAY ET AL.

relying on statistical significance for the selection of variables in a multivariate model. To address these four issues, we adapt the instrument used in economics by McCloskey and Ziliak (1996) and Ziliak and McCloskey (2004). We used this instrument to code 82 articles in criminology and criminal justice selected from three sources: Criminology, the flagship journal of the American Society of Criminology, Justice Quarterly, the flagship journal of the Academy of Criminal Justice Science, and a review piece by Farrington and Welsh (2005) on experiments in criminal justice. In each case our goal is to focus on outlets representing the best practice in a particular area of the field. We find very similar results across the outlets. In general, we find both good and bad practices in the field. On the one hand, most researchers provide the basic information necessary to understand effect sizes and analytical significance in tables which include descriptive statistics and some standardized measure of size (e.g., betas, odds ratios). In fact, most researchers describe the size of their coefficients in some way. On the other hand, only 31% of the articles mention power and fewer than 10% of the articles discuss the standards by which a finding would be considered large or small. None of the articles explicitly test statistical power with a specific alternative hypothesis. It is not surprising, therefore, that only 40% of the articles distinguish between analytical significance and statistical significance, and only about 30% of the articles avoid using the term Fsignificance_ in ambiguous ways. In large part, research in this field equates statistical significance with substantive significance. Researchers need to take the next step and start to compare effect sizes across studies rather than simply conclude that they have similar effects solely on the basis of a statistically significant finding in the same direction as previous work. The paper proceeds in the next section by discussing the sampling frame, followed by a discussion of the instrument, and finally, the results.

Materials and methods Sampling frame We focus on the prominent articles in three discrete subfields Y criminology, criminal justice, and experimental criminology. As such, we selected articles in the journals Criminology and Justice Quarterly in the years 2001 and 2002. We recognize that this strategy presents more of a snapshot than an overview; however, we would be surprised to find that practice in this two year window was substantially different than practice in the surrounding years. We selected all articles that used bivariate analysis, ordinary least squares (OLS) methods (e.g., OLS regression, analysis-of-variance), or logistic regression. Although NHST is also appropriate for the broader class of non-linear models, we followed McCloskey and Ziliak (1996) by focusing on the simplest models in common usage.

SIZE MATTERS

5

There were 66 published articles in 2001Y2002 in Criminology, and of those, 32 met our eligibility criteria. There were 62 articles in 2001Y2002 in Justice Quarterly and of those, 32 met our eligibility criteria. The most commonly excluded analyses were hierarchical, structural equation, tobit and count models. Because we also wanted to include program evaluations in our analysis, we used the recent Farrington and Welsh (2005) review of experimental studies as our sampling frame. We would have preferred to pick a journal like the Journal of Experimental Criminology rather than a review article. However, this journal is only in its second year, and there was no other unified source of experiments in criminology. The Farrington and Welsh review, published in this journal, represented what we believe is the definitive list of experiments in criminology. We focused on articles that had been published in journals since 1995. Of these 27 articles, we were able to acquire 18 through the University of Maryland library. Because these articles met the criteria for inclusion in Farrington and Welsh’s (2005) review, we assume that they represent the state of experimental criminology, and are not significantly different from the other nine articles.

Coding protocol The original instrument by McCloskey and Ziliak (1996) had 19 questions. We eliminated seven questions for one of three reasons: (1) we felt that McCloskey and Ziliak were taking an extreme position, such as when they requested simulations to determine if regression coefficients were reasonable, (2) we felt the question was redundant, or (3) we judged the question to be ambiguous, making it difficult to code consistently across studies. We also added three questions, one to explicitly test the issue raised by Weisburd et al. (2003) with respect to accepting the null hypothesis, and two others to address the use of confidence intervals to aid in the interpretation of effect sizes, particularly for null findings All questions were coded as one for good practice and zero for bad practice. The questions and coding conventions, grouped by the four main issues, were: A. Reporting effect size, or the information necessary to determine effect size 1) Were the units and descriptive statistics for all variables used in bivariate and multivariate analyses reported? Adequate reporting of descriptive statistics is necessary to fully assess the magnitude of effects. McCloskey and Ziliak insisted only on the display of means, but differences between means can only be properly interpreted in the context of sample variability. As such, to be considered good practice, we required that means and standard deviations be reported for continuous variables. Furthermore, we insisted that studies include descriptive statistics for all variables including in analyses. A number of studies provided only partial information and therefore were not given credit for good practice on this item. In this and other items, one could argue that this Fmistake_ is quite minor. Nonetheless, this type of omission makes understanding the substantive signifi-

6

SHAWN D. BUSHWAY ET AL.

cance of findings difficult, and complicates the comparison of effect sizes across studies. We did not determine if the sample sizes in the descriptive statistics table matched the sample sizes in the analysis although technically this should be the case. 2) Were coefficients reported in elasticity (% change Y/% change X) form or in some interpretable form relevant for the problem at hand so that readers can discern the substantive impact of the regressors? While we did find a few studies that used elasticities, this practice is uncommon in criminology and criminal justice. It is much more common to see standardized betas, odds ratios, or other effect size indices reported. In order to get credit for good practice, the study author had to provide betas/odds ratios/elasticities in all of the tables in which multivariate models were reported. This question was not applicable for papers that did not use multivariate regressions. 3) Did the paper eschew Basterisk econometrics,^ defined as ranking the coefficients according to the absolute size of the t-statistics? Reporting coefficients without a t-statistic, P-value, or standard error counts as Fasterisk econometrics._ 4) Did the paper present confidence intervals to aid in the interpretation of the size of coefficients? Any presentation, either in tables or the text was coded as good practice. Just as a standard deviation provides a context for interpreting a mean by providing information about the variability in a distribution, confidence intervals provide a context for interpreting coefficients by providing information on precision or the plausible range within which the population parameter is likely to fall. The use of confidence intervals as an adjunct to NHST has been widely recommended as a method to address weaknesses in NHST (e.g., APA Task Force on Statistical Inference, 1996). B. Interpreting effect size, not just statistical significance 5) Did the paper discuss the size of the coefficients? Any mention of the size of a coefficient in substantive terms in the text of the paper was coded as good practice. This coding decision was more lenient than that applied by McCloskey and Ziliak, but nonetheless we still found articles which failed this question. Simply listing the betas in replication of the table was not sufficient. 6) Did the paper discuss the scientific conversation within which a coefficient would be judged Flarge_ or Fsmall_? In other words, did the authors explicitly consider what other authors had found in terms of effect size or what standards other authors had used to determine importance? Time after time, authors claimed that their results were similar to prior results in the literature solely on the basis of a statistically significant finding in the same direction as previous findings. This question required that some attempt was made to compare effect size across studies in an attempt to build knowledge. One could argue that this exercise is somewhat futile in a field like criminology where many of the variables are unique scales or otherwise constructed variables without inherent meaning, and where treatments are applied to unique populations. But a comparison of betas or odds ratios is informative and might encourage a more standardized approach to measurement. Moreover, we find it hard to understand how criminological understanding can be

SIZE MATTERS

7

advanced simply by knowing the sign and significance of an effect with no substantive understanding. This is particularly problematic in theory testing, where every theory has found support in the form of a significant coefficient in a reduced form model on the variable or variables thought to best represent the theory. Some index of the size of the observed effect is critical to understanding the importance and theoretical implications of a finding. 7) After the first use, did the paper avoid using statistical significance as the only criterion of importance? The other most common reference was to measures of fit such as R2. 8) In the conclusion section, did the authors avoid making statistical significance the primary means for evaluating the importance of key variables in the model? In other words, did the authors keep statistical significance separate from substantive meaning. This was coded as Fnot applicable_ if the paper did not center on key independent variables (e.g., exploratory analyses). 9) Did the paper avoid using the word Bsignificance^ in ambiguous ways, meaning Bstatistically significant^ in one sentence and Blarge enough to matter for policy or science^ in another? We conducted an electronic text search for the word fragment Fsignific_ and evaluated each usage to determine if the usage was clear. This item was coded as not applicable if no use of the word fragment Fsignific_ was found in the article. We did not automatically code a study as using Fsignificance_ ambiguously if Fsignificant_ was used without qualification (statistical v. substantive); rather, use of the term had to be consistent (if statistically significant was the default use, then a qualifier must be used if the authors meant Fsubstantively significant._) One ambiguous usage was enough to get a Fbad practice_ score on this question. C. Interpreting statistically non-significant effects 10) Did the paper mention the power of a test? We coded two types of power discussions. The first type involved some discussion of power (usually sample size) and its impact on parameter estimates. The second type of power discussion was an actual test of power in the study. We had some concerns that this question puts undue emphasis on post-hoc power analysis. While appropriate in some cases, post-hoc power analysis can lead to mistakes in interpretation because it requires the assumption of a Fknown_ or hypothetical effect size. A more general approach advocated in psychology and statistics involves the presentation of confidence intervals around the coefficients so readers can observe the range of values consistent with the analysis (APA 2001; Hoenig and Heisey 2001). This practice has yet to see widespread use in criminology and criminal justice, but we believe that confidence intervals could easily be presented in most papers. 11) Did the paper make use of confidence intervals to aid in the interpretation of null findings? A statistically non-significant effect does not, in and of itself, provide a basis for accepting the null. Recall that NHST assumes the truthfulness of the null and then determines the probability of the observed data given that assumption. A statistically non-significant effect merely means that the data could reasonably occur by chance in a reality where the null hypothesis were true. This

8

SHAWN D. BUSHWAY ET AL.

establishes the plausibility of the null but little more. It is fundamentally a weak conclusion. By providing a range of plausible values for the population effect, a confidence interval greatly facilitates the interpretation of a null finding. A confidence interval that includes the null and does not include a substantively meaningful value provides evidence that the effect is functionally null. A large confidence interval that includes substantively meaningful values despite being statistically non-significant, however, leaves open the possibility that the null is not only false but that a genuine effect of substantive size exists. See Rosenthal and Rubin (1994) for an interesting discussion of this issue. 12) Did the paper eschew Fsign econometrics_ meaning remarking on the sign but not the size or significance of the coefficients? The most common form of Bsign econometrics^ is a discussion of the sign of a non-significant coefficient without a larger justification. A larger justification would include a statement that the effect sizes were large but not statistically significant because sample sizes were small. This question is explicitly focused on researchers who place a premium on statistical significance but report sign as if it is independent of statistical significance. The only case where researchers are justified in reporting the sign of a non-significant coefficient is when it is substantively meaningful, and there are sample limitations (Greene 2003, Ch. 8). In general this question applies only to non-significant coefficients, but we also coded cases where researchers reported a comparison of two (significant) coefficients without an explicit hypothesis test. In this case, saying that coefficient A is bigger than coefficient B without considering statistical significance is focusing on the sign of the difference. 13) In the conclusions, did the authors avoid interpreting a statistically nonsignificant effect with no power analysis or confidence interval as evidence of no relationship? We developed this question based on Weisburd et al. (2003). Concluding there is no effect was justified if a confidence interval does not contain a meaningful effect. This was only coded for papers that were dealing with explicit treatments (the Farrington sample), which was directly comparable to the Weisburd et al. sample. D. Avoiding basic errors in the application of NHST 14) Were the proper null hypotheses specified? The most common null hypothesis is that the coefficient is zero. In fact, this is usually a default position for researchers who estimate a simple reduced form regression without very much structure. Yet, this may not be the relevant null hypothesis. In economics, theory often makes an explicit prediction about the size of a coefficient. We can think of only one case in criminology where this might be true Y in sentencing research, researchers have begun including the presumptive sentence from sentence guidelines (Engen and Gainey 2000). If judges follow the guidelines with random variation, the coefficient should be 1. However, criminologists do sometimes want to compare coefficients across groups, which implies a non-zero null. 15) Did the paper avoid choosing variables for inclusion solely on the basis of statistical significance? The standard logic, fairly common in the social sciences, is that if variables are statistically significant, they should be included even without

9

SIZE MATTERS

theoretical justification. But this approach essentially equates statistical significance with substantive significance. Variables should be included because theory suggests that this is the process that generates the data. We coded as bad practice papers for which the only justification for the inclusion of variables was the finding of significance in prior studies. We admit to some ambivalence on this point because it seems like something of a semantic point, but it is nonetheless true that this approach is logically flawed. A much more problematic practice is the use of stepwise regression. Simulations have shown that stepwise regression can lead to statistically significant and strong results for variables that are unrelated to the dependent variable (Freedman 1983). We coded as bad practice any article which excluded control variables because of a lack of statistical significance. Each article was coded by two coders.3 Two coders were responsible for each set of articles, so while the identity of coders switched between outlets, they remained constant within an outlet. The concordance rate was 76% in Criminology, 73% in Justice Quarterly, and 72% in the Farrington articles. After coding independently, the coders met to reconcile their decisions. The reconciled coding is reported in this paper.

Results The results of the survey suggest that researchers in the field of criminology and criminal justice are not applying NHST blindly, with a majority of studies providing information about the size of effects. We found many examples of good practice and we also found that some authors were quite explicit in their understanding of the limitations of NHST. However, we also found many examples of bad practice. For example, most researchers failed to discuss the size of coefficients in substantive terms or clearly distinguished statistically significant effects from substantive significance ones. We provide the distribution of scores across the items in Table 1. The scores are very similar across outlet. The average paper received a score of 7. Several papers received a high score of 10. The lowest was a paper with a score of 2. Table 2 provides the percentage of studies correctly addressing each item. The (relatively) good news is that most researchers in criminology presented statistical information necessary for an assessment of size. More specifically, roughly threeTable 1. Descriptive statistics for the number of correct scores by article source. Source

Mean

Median

Minimum

Maximum

SD

Percent

N

Justice Quarterly Criminology Farrington experiments Total

7.0 6.8 7.1 6.9

7 7 7 7

2 2 2 2

10 10 10 10

1.8 2.0 1.8 1.8

54.5% 52.9% 52.0% 53.3%

32 32 18 82

Percent reflects the mean percent correct for applicable items.

10

SHAWN D. BUSHWAY ET AL.

Table 2. Survey results, full sample. Item Presented statistical information necessary for a determination of the size of an effect 1. Were the units and descriptive statistics reported for all variables used in bivariate and multivariate analysis reported? 2. Were coefficients reported in elasticity form or some other interpretable form relevant for the problem at hand so that readers could discern the substantive impact of the regressors? 3. Did the paper eschew Fasterisk econometrics_, defined as ranking the coefficients according to the absolute size of the t-statistics? 4. Did the paper present confidence intervals to aid in the interpretation of the size of coefficients? Interpretation informed by the size of a coefficient, not merely its statistical significance 5. Did the paper discuss the size of the coefficients? 6. Did the paper discuss the scientific conversation within which a coefficient would be judged Flarge_ or Fsmall_? 7. After the first use, did the paper avoid using statistical significance as the only criterion of importance? 8. In the conclusion section, did the authors avoid making statistical significance the primary means for evaluating the importance of key variables in the model? 9. Did the paper avoid using the word Bsignificance[ in ambiguous ways, meaning Bstatistical significant[ in one sentence and Blarge enough to matter for policy or science[ in another? Correctly handled non-significant effects 10. Did the paper mention the power of a test? 11. Did the paper make use of confidence intervals to aid in the interpretation of null findings? 12. Did the paper eschew Fsign econometrics_ meaning remarking on the sign by not the size or significance of the coefficients? 13. In the conclusions, did the authors avoid interpreting a statistically non-significant effect with no power analysis as evidence of no relationship? Avoided basic errors in the application of statistical significance testing 14. Were the proper null hypotheses specified? 15. Did the paper avoid choosing variables for inclusion solely on the basis of statistical significance?

N = 82

64.6% 76.1%

57.3% 3.7%

75.0% 9.8% 75.6% 44.2% 31.7%

30.5% 0.0% 58.0% 55.6%

95.1% 78.0%

quarters of the authors reported some standardized version of their coefficient in order to facilitate interpretation,4 and over two-thirds provided descriptive statistics for all of the variables used in their analysis. The sole exception was the failure to report confidence intervals for all but three studies in our sample. It would clearly be better if 100% of authors provided these basic factsYindeed, over a third of these studies might need to be excluded from a meta-analysis because of the lack of basic descriptive statistics. Nonetheless, our results did show that the majority of articles provide the basic building blocks for a comparison of effect size. These building blocks provide a starting point for the conversation about size as well as statistical significance. It is also encouraging to note that these results are very

SIZE MATTERS

11

similar to the results from the most recent Ziliak and McCloskey (2004) review in economics, which suggests that the problems in criminology are shared with at least one other social science. One area where authors in criminology struggled was in interpreting their results with respect to the size of effects. Despite a fairly common discussion of size in the results section, many authors ultimately fell back on statistical significance as the ultimate arbiter of importance. Perhaps then it is not surprising that two-thirds of the authors used some form of the word Fsignificance_ in an ambiguous way. It was common to find statements that a variable Fachieved significance_ or became Fmore/less significant_ in the presence of additional variables. Variables were described as being Fmodestly significant_ or Fhighly significant._ In general, the most common error was implying that the strength of the relationship was determined by the size of the p-value. This is simply not true, and can be quite misleading if large samples lead researchers to stress small effects with very large t-values. In keeping with this error, nearly half of the authors practiced Fasterisk econometrics,_ rank ordering the coefficients according to the size of the test statistics. There is simply no scientific justification for equating statistical significance with substantive significance, either in relative or absolute terms. Statistical significance should be the starting point in any discussion of effect size, not the end point. Very few of the authors (9.8%) presented a discussion of the standard by which their effects would be considered large or small. Virtually none of the authors compared the magnitude of their results with other studies, although information is provided in most cases that would allow this comparison. Gottfredson and colleagues (2003, evaluation sample) provide a notable exception. They compared their estimate of the effect of drug treatment courts on recidivism to the average effect reported in a prior meta-analysis. This simple comparison, usually absent in other studies, allows the reader to assess the importance of the findings beyond mere statistical significance. Repeatedly, authors claimed similar findings as prior studies based solely on coefficients which were significant in the same direction. Although this information is relevant, sign and significance represent a very coarse description which could be made far more precise with a discussion of the effect sizes in the two studies. Alternatively, theories could be developed more formally to provide explicit predictions about the size of the coefficients. Non-significant effects are problematic in the social sciences. Almost half of the papers discuss the direction of non-significant effects (Fsign econometrics_) without attending to the issue of size or adequately addressing the non-significant nature of the effect. We found many cases where the authors adhered to strict NHST but then focused on the sign of the coefficient. For example, in one Criminology article the authors stated that the original bivariate relationship, although not significant, was positive. Then, they report that when additional variables were included, the sign of the relationship changed, Bsuggesting that the (initial) relationship was spurious.^ But according to NHST, the relationship was spurious even without the additional variables. In neither case do we know for certain that the relationship is substantively spurious. Without a high level of statistical power, substantively

12

SHAWN D. BUSHWAY ET AL.

meaningful effects remain plausible even though the null hypothesis cannot be rejected. Equivocation is difficult to avoid under strict NHST. Augmenting NHST with confidence intervals can facilitate the interpretation of null findings (Hoenig and Heisey 2001). In the articles examined, there was virtually no consideration of Type II error (falsely accepting the null hypothesis). Less than a third of the authors mentioned the issue of power, and only two out of 82 articles included a power test.5 Moreover, almost half of the evaluations from the Farrington and Welsh review equated a failure to reject the null hypothesis as equivalent to an effect size of zero without conducting a power test or examining the confidence interval. None of the articles in the Justice Quarterly or Criminology samples conducted a power test, and only one in Justice Quarterly and two in the experimental sample provided a confidence interval. Authors simply do not provide an analytical discussion of their ability to find differences that may be meaningful. In general, the studies included in this sample did well at avoiding basic errors in NHST. Roughly three-fourths of the studies avoided relying on statistical significance for the selection of variables in multivariate models and most studies correctly specified the null hypothesis. A particularly good example of the latter was Steffensmeier and Demuth (2001). These authors used a population of all people convicted of a felony or misdemeanor in Pennsylvania, with over 68,000 cases. First, they demonstrated that they understood that populations should not be treated as samples by stating that, Bbecause our data set is not a sample, but contains all reported sentences with complete data, statistical tests of significance do not apply in the conventional sense^ (p. 160). They then argued, in a fairly conventional way, for inclusion of the tests, but then immediately showed that the tests were not being applied without thinking. Specifically, they stated that Bbecause the number of cases included in our analysis is so large, many small sentencing differences among groups or categories often turn out to be significant in the statistical sense. Therefore, we place more emphasis on direction and magnitude of the coefficients than on statistical significance levels . . .^ (p. 160). This is exactly the correct approach to take in their case. In contrast, researchers conducting studies with small samples would be better served by focusing on confidence intervals and interpreting the range of possible values for a population parameter (e.g., from zero to a moderately large effect). Other examples of good practice involve the specification of a null hypothesis. Over 90% of the articles received a positive rating on this question, even in cases when the null was non-zero. For example, Koons-Witt (2002) wanted to explore the impact of gender on sentencing decisions before and after the onset of sentencing guidelines. Therefore, she correctly specified the null hypothesis as equality between the coefficients on gender in separate models.6 Moreover, researchers often specify models in which a mediating variable is included, and the coefficient on the original variable is expected to decline. For example, Kleck and Chiricos (2002) hypothesized that the coefficient on unemployment in a regression of crime and unemployment would change when explicit measures of motivation and opportunity were included. In this case, they correctly specified the null

13

SIZE MATTERS Table 3. Results by item and source of study.

Item Presented statistical information necessary for a determination of the size of an effect 1. Were the units and descriptive statistics reported for all variables used in bivariate and multivariate analysis reported? 2. Were coefficients reported in elasticity form or some other interpretable form relevant for the problem at hand so that readers could discern the substantive impact of the regressors? 3. Did the paper eschew Fasterisk econometrics_, defined as ranking the coefficients according to the absolute size of the t-statistics? 4. Did the paper present confidence intervals to aid in the interpretation of the size of coefficients? Interpretation informed by the size of a coefficient, not merely its statistical significance 5. Did the paper discuss the size of the coefficients? 6. Did the paper discuss the scientific conversation within which a coefficient would be judged Flarge_ or Fsmall_? 7. After the first use, did the paper avoid using statistical significance as the only criterion of importance? 8. In the conclusion section, did the authors avoid making statistical significance the primary means for evaluating the importance of key variables in the model? 9. Did the paper avoid using the word Bsignificance[ in ambiguous ways, meaning Bstatistical significant[ in one sentence and Blarge enough to matter for policy or science[ in another? Correctly handled non-significant effects 10. Did the paper mention the power of a test? 11. Did the paper make use of confidence intervals to aid in the interpretation of null findings? 12. Did the paper eschew Fsign econometrics_ meaning remarking on the sign by not the size or significance of the coefficients? 13. In the conclusions, did the authors avoid interpreting a statistically non-significant effect with no power analysis as evidence of no relationship? Avoided basic errors in the application of statistical significance testing 14. Were the proper null hypotheses specified? 15. Did the paper avoid choosing variables for inclusion solely on the basis of statistical significance? a

Criminology Justice quarterly Farrington 2001Y2002 2001Y2002 Experiments N = 32 N = 32 N = 18

72.7%

59.4%

55.6%

78.1%

86.2%

45.5%a

69.7%

56.3%

33.3%

0.0%

3.1%

11.1%

65.6% 6.1%

87.5% 3.1%

70.6% 27.8%

66.7%

84.4%

72.2%

35.7%

38.7%

66.7%

24.2%

37.5%

33.3%

30.3% 0.0%

28.1% 0.0%

33.3% 0.0%

57.6%

61.3%

50%

Y

Y

55.6%

90.9% 75.8%

96.9% 71.9%

94.4% 88.9%

Only 11 of the 18 experiments were coded on this question because the results were simple mean comparisons. Of the remaining seven, regressions were occasionally run to support the main finding, and several articles did not report these regressions in tabular form.

14

SHAWN D. BUSHWAY ET AL.

hypothesis as the coefficient in the original equation. Throughout the discussion, Kleck and Chiricos explicitly focused on the change in the coefficient. However, we did find examples in which researchers incorrectly concluded that the effect had been mediated because the coefficient was no longer significantly different from zero. Table 3 provides the results by source. For the most part, our results did not vary by source. The Criminology papers were more likely to provide descriptive statistics and avoid asterisk econometrics than Justice Quarterly papers, but Justice Quarterly papers were more likely to talk about size and avoid ambiguous usage of significance. In no way could it be said that authors in Criminology scored systematically better than the authors from the two other sources. The main difference was that the articles in the Farrington and Welsh sample were almost seven times more likely to discuss some standard of size. This makes sense because the experimental studies often reported program effect sizes in a research arena with prior results. Nonetheless, the vast majority of experiments also did not discuss the standard by which magnitude could be evaluated. In general, the pattern across the article sources was far more similar than different. Clearly, the problems are fairly endemic across the field and are not solely the responsibility of any one group of substantive researchers.

Discussion NHST is the dominant approach to drawing inferences regarding research hypotheses in quantitative criminology and criminal justice. Although alternatives like Bayesian inference exist, the purpose of this article is not to convince the field to abandon NHST. Rather, we would like to encourage more thoughtful application. In essence, we want to put NHST in its place: as a tool to facilitate the inferential process, not as the end game for quantitative research. The fundamental limitation of NHST is that it does not provide information about size. As the title of this article states, size matters. To state that there is an effect begs the question, how big? This requires attention to size and a scholarly discussion that addresses the substantive significance of findings. Therefore, it is not surprising that we believe our single most troubling finding was the lack of a serious attempt in most articles to place the magnitude of the effect in a context or even to attend to the issue of size. A research study should be placed in the context of past work or theoretical predictions. Simply reporting the coefficient without any attempt to validate or otherwise establish the magnitude of the effect within the literature or policy framework risks creating a large body of independent research with no cumulative advance in knowledge. This is particularly evident when studies of a common research hypothesis with different sample sizes arrive at different conclusions based solely on NHST. Without attending to size, researchers may conclude that the empirical research base has lead to an equivocal conclusion regarding the hypothesis. However, focusing on size may tell a more consistent story, or at least a story that is not determined by the sample size of the studies but

SIZE MATTERS

15

rather by the size of the empirical relationships examined. It is the latter, after all, in which we are truly interested. On the positive side, many of the key ingredients for a substantive discussion of size, like descriptive statistics and standardized coefficients, were reported in most of the studies in our review. All of the basic tools are there for researchers to take the next step and compare effect sizes across studies rather than simply concluding that they have similar findings solely on the basis of a significant finding in the same direction as previous work. The role of sample size in determining statistical significance is also underappreciated. This is evident in our study by the absolute lack of any discussion of statistical power. In a research world in which sample sizes range from a few dozen to 68,000, this is particularly alarming, and strikes us as fundamentally unwise. All research designs do not have equal ability to identify the same effect size. A discussion about the relative power of a test to identify an effect and an awareness of the confidence intervals around an effect seems to be both reasonable and essential for a good evaluation of the value of the study. One potential criticism of this type of discussion about NHST is from wellknown econometrician Edward Leamer (2004). In his criticism of McCloskey and Ziliak, he states that it is ultimately not size, but models, that matter. Too much emphasis on the minutiae of NHST threatens to take the attention away from the important question of whether the model provides insight into the question of interest. During our coding, we were often frustrated by the lack of discussion about the source of causal identification in the regression models, and the general lack of understanding about the limitations of observational studies with controls for observables to identify causality. While the criminology and criminal justice articles were not substantially different from the economics articles with respect to the use of NHST, we feel confident in stating that the application of causal models based on observational data is in fact substantially more thoughtful in economics. We were also frustrated by the lack of attention to measurement issues in many of the papers we reviewed. Often authors would raise key issues with respect to measurement only to proceed without addressing them. One anonymous reviewer made a compelling argument that this problem in criminology is far more pressing than any discussion about NHST. We do not necessarily disagree, but we nonetheless think that issues surrounding the appropriate use of NHST deserve attention by criminologists. The goal of NHST is admirable: to protect against the acceptance of a research hypothesis when the observed data can be explained by sampling variation. This simple goal, however, has taken on a hegemonic role in the practice of social scientific research. Critical thinking about the meaningfulness of findings in scientific and practical terms is often lacking. Size matters. Large effects have different theoretical and practical importance than small effects. A binary accept/ reject approach to hypothesis testing advances our field far less than approaches that explicitly assess whether observed effects are of a size consistent with theoretical expectations or are large enough to matter in a practical or policy context. This requires reasoned argumentation and scientific discourse, rather than a reliance on an arbitrary and binary decision rule (i.e., p e 0.05). The former

16

SHAWN D. BUSHWAY ET AL.

requires greater skill and scholarly effort but also promises greater advancement for our field.

Acknowledgement The authors wish to thank Emily Owens for help with the coding and University of Maryland’s Program for Economics of Crime and Justice Policy for generous financial support. We also wish to thank Michael Maltz and two anonymous reviewers for helpful advice. All errors remain our own.

Notes 1 It should be noted that there is no controversy about what constitutes correct use Y the

problem is unambiguously in the application. 2 This issue has received less attention in criminology and criminal justice than in these

3

4

5

6

other disciplines. Maltz’s 1994 paper in the Journal of Research and Crime and Delinquency is the best-known paper on the problems with NHST in criminology. Three coders were used in the analysis. All three coders have successfully completed at least four upper level courses in econometrics in an economics department and are at least 4th year PhD students. All three coders read the McCloskey and Ziliak (1996) piece and participated in pilot coding and reconciling as a group. Almost every author who reported logit coefficients also reported the odds ratio. We suspect this is because statistics packages now provide this information easily. However, there was some confusion about the interpretation of the odds ratio. For example, we found one paper in Justice Quarterly in which the authors interpreted the coefficients as the odds ratio, even though the odds ratio was also provided in the table. We recommend that authors simply report the change in probability associated with each x. This is far more intuitive, and several statistics packages, including Stata, now provide this information with a simple command (dlogit). Even the two papers which conducted explicit power tests did not provide satisfying discussions of power. Both papers were ambiguous about the effect size for which they estimated power. Furthermore, one of the papers suggested that the low power associated with their statistical tests minimized the probability of Type II error. In fact, by definition low power is associated with a higher probability of Type II error. Koons-Witt cited Brame et al. (1998) to justify the test statistic used to test the null hypothesis. Citation of this paper was very common in our sample. However, Brame et al. stated that their test is appropriate only for OLS and count models. This test can be used with logit or probit models if the researcher assumes Bthat both the functional form and the dispersion of the residual term for the latent response variable are identical for the two groups being compared^ (p. 259, fn11). These are strong assumptions which are unlikely to be met in most cases. These assumptions were never justified, or even stated, in the papers that compared coefficients from logit or probit models.

SIZE MATTERS

17

References (Asterisk indicates papers in sample.) *Agnew, R. (2002). Experienced, vicarious, and anticipated strain: An exploratory study on physical victimization and delinquency. Justice Quarterly 19, 603Y632. *Agnew, R., Brezina, T., Wright, J. P. & Cullen, F. T. (2002). Strain, personality traits, and delinquency: Extending general strain theory. Criminology 40, 43Y72. *Alpert, G. P. & MacDonald, J. M. (2001). Police use of force: An analysis of organizational characteristics. Justice Quarterly 18, 393Y409. Anderson, D. R., Burnham, K. P. & Thompson, W. L. (2000). Null hypothesis testing: Problems, prevalence, and an alternative. Journal of Wildlife Management 64, 912Y923. APA (2001). Publication manual of the American Psychological Association, (5th edition), Washington, DC: American Psychological Association. APA Task Force on Statistical Inference (1996, December). Task Force on Statistical Inference initial report. Washington, DC: American Psychological Association. Available: http://www.apa.org/science/tfsi.html. *Armstrong, T. A. (2003). The effect of moral reconation therapy on the recidivism of youthful offenders: A randomized experiment. Criminal Justice and Behavior 30, 668Y687. Arrow, K. J. (1959). Decision theory and the choice of a level of significance for the t-test. In I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow & H. B. Mann (Eds.), Contributions to probability and statistics: Essays in honor of Harold Hotelling (pp. 70Y78). Stanford, CA: Stanford University Press. *Baller, R. D., Anselin, L., Messner, S. F., Deane, G. & Hawkins, D. F. (2001). Structural covariates of U.S. county homicide rates: Incorporating spatial effects. Criminology 39, 561Y590. *Baumer, E. P. (2002). Neighborhood disadvantage and police notification by victims of violence. Criminology 40, 579Y616. Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the chi-square test. Journal of the American Statistical Association 33, 526Y536. *Bernburg, J. G. & Thorlindsson, T. (2001). Routine activities in social context: A closer look at the role of opportunity in delinquent behavior. Justice Quarterly 18, 543Y568. *Borduin, C. M., Mann, B. J., Cone, L. T., Henggeler, S. W., Fucci, B. R., Blaske, D. M. & Williams, R. A. (1995). Multisystemic treatment of serious juvenile offenders: Long-term prevention of criminality and violence. Journal of Consulting and Clinical Psychology 63, 569Y578. Boring, E. G. (1919). Mathematical vs. scientific importance. Psychological Bulletin 16, 335Y338. *Braga, A. A., Weisburd, D. L., Waring, E. J., Mazerolle, L. G., Spelman, W. & Gajewski, F. (1999). Problem-oriented policing in violent crime places: A randomized controlled experiment. Criminology 37, 541Y580. Brame, R., Paternoster, R., Mazerolle, P. & Piquero, A. (1998). Testing for the equality of maximum-likelihood regression coefficients between two independent equations. Journal of Quantitative Criminology 14, 245Y261. *Broidy, L. M. (2001). A test of general strain theory. Criminology 39, 9Y36. *Burruss, G. M. Jr., & Kempf-Leonard, K. (2002). The questionable advantage of defense counsel in juvenile court. Justice Quarterly 19, 37Y68. *Campbell, F. A., Ramey, C. T., Pungello, E., Sparling, J. & Miller-Johnson, S. (2002).

18

SHAWN D. BUSHWAY ET AL.

Early childhood education: Young adult outcomes from the Abercedarian project. Applied Developmental Science 6, 42Y57. *Cernkovich, S. A., & Giordano, P. C. (2001). Stability and change in antisocial behavior: The transition from adolescence to early adulthood. Criminology 39, 371Y410. *Chermak, S., McGarrell, E. F. & Weiss, A. (2001). Citizens’ perceptions of aggressive traffic enforcement strategies. Justice Quarterly 18, 365Y392. Cook, T. D., Gruder, C. L., Hennigan, K. M. & Flay, B. R. (1979). The history of the sleeper effect: Some logical pitfalls in accepting the null hypothesis. Psychological Bulletin 86, 662Y679. *Copes, H., Kerley, K. R. Mason, K. A. & Van Wyk, J. (2001). Reporting behavior of fraud victims and Black’s theory of law: An empirical assessment. Justice Quarterly 18, 343Y364. Cumming, G. & Finch, S. (2001). A primer on the understanding, use and calculation of confidence intervals that are based on central and non-central distributions. Educational and Psychological Measurement 61, 532Y575. *Curry, G. D., Decker, S. H. & Egley, A. Jr. (2002). Gang involvement and delinquency in a middle school population. Justice Quarterly 19, 275Y292. *Dawson, M. & Dinovitzer, R. (2001). Victim cooperation and the prosecution of domestic violence in a specialize court. Justice Quarterly 18, 593Y622. *DeJong, C., Mastrofski, S. D. & Parks, R. B. (2001). Patrol officers and problem solving: An application of expectancy theory. Justice Quarterly 18, 31Y62. *Dugan, J. R. & Everett, R. S. (1998). An experimental test of chemical dependency therapy for jail inmates. International Journal of Offender Therapy and Comparative Criminology 42, 360Y368. *Dunford, F. W. (2000). The San Diego Navy Experiment: An assessment of interventions for men who assault their wives. Journal of Consulting and Clinical Psychology 68, 468Y476. Elliott, G. & Granger, C. W. J. (2004). Evaluating significance: Comments on Bsize matters^. The Journal of Socio-Economics 33, 547Y550. *Engel, R. S. & Silver, E. (2001). Policing mentally disordered suspects: A reexamination of the criminalization hypothesis. Criminology 39, 225Y252. Engen, R. L. & Gainey, R. R. (2000). Modeling the effects of legally relevant and extralegal factors under sentencing guidelines: The rules have changed. Criminology 38, 1207Y1230. *Exum, M. L. (2002). The application and robustness of the rational choice perspective in the study of intoxicated and angry intentions to aggress. Criminology 40, 933Y966 Farrington, D. P. & Welsh, B. C. (2005). Randomized experiments in criminology: What have we learned in the last two decades? Journal of Experimental Criminology 1, 9Y38. *Feder, L. & Dugan, L. (2002). A test of the efficacy of court-mandated counseling for domestic offenders: The Broward experiment. Justice Quarterly 19, 343Y376. *Felson, R. B. & Ackerman, J. (2001). Arrest for domestic and other assaults. Criminology 39, 655Y676. *Felson, R. B. & Haynie, D. L. (2002). Pubertal development, social factors, and delinquency among adolescent boys. Criminology 40, 967Y988. *Felson, R. B., Messner, S. F., Hoskin, A. W. & Deane, G. (2002). Reasons for reporting and not reporting domestic violence to the police. Criminology 40, 617Y648. Fidler, F. (2002). The fifth edition of the APA Publication manual: Why its statistics recommendations are so controversial. Educational and Psychological Measurement 62, 749Y770.

SIZE MATTERS

19

*Finn, M. A. & Muirhead-Steves, S. (2002). The effectiveness of electronic monitoring with violent male parolees. Justice Quarterly 19, 293Y312. Fisher, R. A. (1935). The design of experiments. Edinburgh, Scotland: Oliver and Boyd. Freedman, D. A. (1983). A note on screening regression equations. The American Statistician 37, 152Y155. *Garner, J. H., Maxwell, C. D. & Heraux, C. G. (2002). Characteristics associated with the prevalence and severity of force used by the police. Justice Quarterly 19, 705Y746. Gigerenzer, G. (1987). Probabilistic thinking and the fight against subjectivity. In L. Kruger, G. Gigerenzer & M. S. Morgan (Eds.), The probabilistic revolution. Vol. II: Ideas in the Sciences (pp. 11Y33). Cambridge, MA: MIT Press. *Golub, A., Johnson, B. D., Taylor, A. & Liberty, H. J. (2002). The Validity of arrestees’ self-reports: Variations across questions and persons. Justice Quarterly 19, 477Y502. *Gottfredson, D. C., Najaka, S. S. & Kearly, B. (2003). Effectiveness of drug treatment courts: Evidence from a randomized trial. Criminology and Public Policy 2, 171Y196. *Greenberg, D. F. & West, V. (2001). State prison populations and their growth, 1971Y1991. Criminology 39, 615Y654. Greene, W. H. (2003). Econometric analysis, (5th edition). Upper Saddle River, NJ: Prentice-Hall. Harlow, L. L., Mulaik, S. A. & Steiger, J. H. (Eds.), 1997. What if there were no significance tests? Mahwah, NJ: Lawrence Erlbaum Associates. *Harmon, T. R. (2001). Predictors of miscarriages of justice in capital cases. Justice Quarterly 18, 949Y968. *Hay, C. (2001). Parenting, self-control, and delinquency: A test of self-control theory. Criminology 39, 707Y736. *Henggeler, S. W., Melton, G. B., Brondino, M. J., Scherer, D. G. & Hanley, J. H. (1997). Multisystemic theory with violent and chronic juvenile offenders and their families: The role of treatment fidelity in successful dissemination. Journal of Consulting and Clinical Psychology 65, 821Y833. *Hennigan, K. M., Maxson, C. L., Sloane, D. & Ranney, M. (2002). Community views on crime and policing: Survey mode effects on bias in community surveys. Justice Quarterly 19, 565Y587. Hoenig, J. M. & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician 55, 19Y24. *Inciardi, J. A., Martin, S. S., Butzin, C. A., Hopper, R. M. & Harrizon, L. D. (1997). An effective model of prison-based treatment for drug-involved offenders. Journal of Drug Issues 27, 261Y278. *Ireland, T. O., Smith, C. A. & Thornberry, T. P. (2002). Developmental issues in the impact of child maltreatment on later delinquency and drug use. Criminology 40, 359Y400. Johnson, D. H. (1999). The insignificance of statistical significance testing. Journal of Wildlife Management 63, 763Y772. *Kaminski, R. J. & Marvell, T. B. (2002). A comparison of changes in police and general homicides: 1930Y1998. Criminology 40, 171Y190. *Kautt, P. & Spohn, C. (2002). Cracking down on black drug offenders? Testing for interactions among offenders’ race, drug type, and sentencing strategy in federal drug sentences. Justice Quarterly 19, 1Y36. *Kempf-Leonard, K., Tracy, P. E. & Howell, J. C. (2001). Serious, violent, and chronic juvenile offenders: The relationship of delinquency career types to adult criminality. Justice Quarterly 18, 449Y478.

20

SHAWN D. BUSHWAY ET AL.

*Killias, M., Aebi, M. & Ribeaud, D. (2000). Does community service rehabilitate better than short-term imprisonment? Results of a controlled experiment. Howard Journal 39, 40Y57. *Kingsnorth, R. F., MacIntosh, R. C. & Sutherland, S. (2002). Criminal charge or probation violation? Prosecutorial discretion and implications for research in criminal court processing. Criminology 40, 553Y578. *Kleck, G. & Chiricos, T. (2002). Unemployment and property crime: A target-specific assessment of opportunity and motivation as mediating factors. Criminology 40, 649Y679. *Koons-Witt, B. A. (2002). The effect of gender on the decision to incarcerate before and after the introduction of sentencing guidelines. Criminology 40, 297Y328. *Kramer, J. H. & Ulmer, J. T. (2002). Downward departures for serious violent offenders: Local court Bcorrections[ to Pennsylvania sentencing guidelines. Criminology 40, 897Y932. Leamer, E. E. (2004). Are the roads red? Comments on Bsize matters.^ The Journal of Socio-Economics 33, 555Y557. Lipsey, M. W. (1990). Design sensitivity: Statistical power for experimental research. Newbury Park, CA: Sage. Lipsey, M. W., Crosse, S., Dunkle, J., Pollard, J. & Stobart, G. (1985). Evaluation: The state of the art and the sorry state of the science. In D. S. Cordray (Ed.), Utilizing prior research in evaluation planning (New Directions for Program Evaluation, No. 27, pp. 7Y28). San Francisco: Jossey-Bass. Lunt, P. (2004). The significance of the significance test controversy: comments on Fsize matters._ The Journal of Socio-Economics 33, 559Y564. *Maguire, E. R. & Katz, C. M. (2002). Community policing, loose coupling, and sensemaking in American police agencies. Justice Quarterly 19, 503Y536. Maltz, M. D. (1994). Deviating from the mean: The declining significance of significance. Journal of Research in Crime and Delinquency 31, 434Y463. Marks, H. M. (1997). The progress of experiment: Science and therapeutic reform in the United States 1900Y1990. Cambridge, UK: Cambridge University Press. *Marlowe, D. B., Festinger, D. S., Lee, P. A., Schepise, M. M., Hazzard, J. E. R., Merrill, J. C., Mulvaney, F. D. & McLellan, A. T. (2003). Are judicial status hearings a key component of drug court? During-treatment data from a randomized trial. Criminal Justice and Behavior 30, 141Y162. *Marquart, J. W., Barnhill, M. B. & Balshaw-Biddle, K. (2001). Fata attraction: An analysis of employee boundary violations in a southern prison system, 1995Y1998. Justice Quarterly 18, 877Y910. *Mastrofski, S. D., Reisig, M. D. & McClusky, J. D. (2002). Police disrespect toward the public: An encounter-based analysis. Criminology 40, 519Y552. *McCarthy, B., Hagan, J. & Martin, M. J. (2002). In and out of harm’s way: Violent victimization and the social capital of fictive street families. Criminology 40, 831Y865. McCloskey, D. N. & Ziliak, S. T. (1996). The standard error of regressions. Journal of Economic Literature 34, 97Y114. *McNulty, T. L. (2001). Assessing the raceYviolence relationship at the macro level: The assumption of racial invariance and the problem of restricted distributions. Criminology 39, 467Y490. *Meehan, A. J. & Ponder, M. C. (2002). Race & place: The ecology of racial profiling African American motorists. Justice Quarterly 19, 399Y430. *Menard, S., Mihalic, S. & Huizinga, D. (2001). Drugs and crime revisited. Justice Quarterly 18, 269Y300. *Mills, P. E., Cole, K. N., Jenkins, J. R. & Dale, P. S. (2002). Early exposure to direct

SIZE MATTERS

21

instruction and subsequent juvenile delinquency: A prospective examination. Exceptional Children 69, 85Y96. *Ortmann, R. (2000). The effectiveness of social therapy in prison: A randomized experiment. Crime and Delinquency 46, 214Y232. *Peterson, D., Miller, J. & Esbensen, F.-A. (2001). The impact of sex composition on gangs and gang member delinquency. Criminology 39, 411Y440. Petrosino, A. (2005). From Martinson to meta-analysis: Research reviews and the US offender treatmentdebate.Evidence & Policy: A Journal of Research, Debate and Practice 1, 149Y172. *Piquero, A. R. & Brezina, T. (2001). Testing Moffitt’s account of adolescent-limited delinquency. Criminology 39, 353Y370. *Pogarsky, G. (2002). Identifying Bdeterrable^ offenders: Implications for research on deterrence. Justice Quarterly 19, 431Y452. *Rebellon, C. J. (2002). Reconsidering the broken homes/delinquency relationship and exploring its mediating mechanism(s). Criminology 40, 103Y136. *Rhodes, W. & Gross, M. (1997). Case management reduces drug use and criminality among drug-involved arrestees: An experimental study of an HIV prevention intervention. Washington, DC: National Institute of Justice. *Richards, H. J., Casey, J. O. & Lucente, S. W. (2003). Psychopathy and treatment response in incarcerated female substance abusers. Criminal Justice and Behavior 30, 251Y276. Rosenthal, R. & Rubin, D. B. (1994). The counternull value of an effect size: A new statistic. Psychological Science 5, 329Y334. Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin 5, 416Y428. Rozeboom, W. W. (1997). Good science is abductive, not hypothetico-deductive. In L. L. Harlow, S. A. Mulaik & J. H. Steiger (Eds.), What if there were no significance tests? (pp. 335Y392). Mahwah, NJ: Lawrence Erlbaum Associates. *Scheider, M. C. (2001). Deterrence and the base rate fallacy: An examination of perceived certainty. Justice Quarterly 18, 63Y86. *Schnebly, S. M. (2002). An examination of the impact of victim, offenders, and situational attributes on the deterrent effect of defensive gun use: A research note. Justice Quarterly 19, 377Y398. *Schwartz, M. D., DeKeseredy, W. S., Tait, D., & Avi, S. (2001). Male peer support and a feminist routine activities theory: Understanding sexual assault on the college campus. Justice Quarterly 18, 623Y650. Sherman, L. W., Gottfredson, D., MacKenzie, D., Eck, J., Reuter, P. & Bushway, S. (1997). Preventing crime: What works, what doesn’t, what’s promising: A report to the United States Congress. Washington, DC: National Institute of Justice. *Silver, E. (2002). Mental disorder and violent victimization: The mediating role of involvement in conflicted relationships. Criminology 40, 191Y212. *Simons, R. L., Stewart, E., Gordon, L. C., Conger, R. D. & Elder, G., Jr. 2002. A test of life-course explanations for stability and change in antisocial behavior from adolescence to young adulthood. Criminology 40, 401Y434. *Spohn, C. & Holleran, D. (2001). Prosecuting sexual assault: A comparison of charging decisions in sexual assault cases involving strangers, acquaintances, and intimate partners. Justice Quarterly 18, 651Y688. *Spohn, C. & Holleran, D. (2002). The effect of imprisonment on recidivism rates of felony offenders: A focus on drug offenders. Criminology 40, 329Y358. *Steffensmeier, D. & Demuth, S. (2001). Ethnicity and judges’ sentencing decisions: HispanicYBlackYWhite comparisons. Criminology 39, 145Y178.

22

SHAWN D. BUSHWAY ET AL.

*Stewart, E. A., Simons, R. L. and Conger, R. D. (2002). Assessing neighborhood and social psychological influences on childhood influences on childhood violence in an AfricanAmerican sample. Criminology 40, 801Y829. *Swanson, J. W., Borum, R., Swartz, M. S., Hiday, V. A., Wagner, H. R. & Burns, B. J. (2001). Can involuntary outpatient commitment reduce arrests among persons with severe mental illness? Criminal Justice and Behavior 28, 156Y189. *Taylor, B. G., Davis, R. C. & Maxwell, C. D. (2001). The effects of a group batterer treatment program: A randomized experiment in Brooklyn. Justice Quarterly 18, 171Y201. *Terrill, W. & Mastrofski, S. D. (2002). Situational and officer-based determinants of police coercion. Justice Quarterly 19, 215Y248. Thompson, B. (2004). The Bsignificance^ crisis in psychology and education. The Journal of Socio-Economics 33, 607Y613. *van Voorhis, P., Spruance, L. M., Ritchey, P. N., Listwan, S. J. & Seabrook, R. (2004). The Georgia cognitive skills experiment: A replication of Reasoning and Rehabilitation. Criminal Justice and Behavior 31, 282Y305. *Velez, M. B. (2001). The role of public social control in urban neighborhoods: A multilevel analysis of victimization risk. Criminology 39, 837Y864. *Vogel, B. L. & Meeker, J. W. (2001). Perceptions of crime seriousness in eight AfricanAmerican communities: The influence of individual, environmental, and crime-based factors. Justice Quarterly 18, 301Y321. Weisburd, D., Lum, C. M. & Yang, S.-M., 2003. When can we conclude that treatments or programs Bdon’t work^? The Annals of the American Academy of Political and Social Science 574, 31Y48. *Weitzer, R. & Tuch, S. A. (2002). Perceptions of racial profiling: Race, class, and personal experience. Criminology 40, 435Y456. Wellford, C., 1989. Towards an integrated theory of criminal behavior. In S. Messner, M. M. Krohn, and A. Liska (Eds.), Theoretical integration in the study of deviance and crime: Problems and prospects (pp. 119Y128). Albany, NY: State University of New York. *Wells, L. E. & Weisheit, R. A. (2001). Gang problems in nonmetropolitan areas: A longitudinal assessment. Justice Quarterly 18, 791Y824. *Welsh, W. N. (2001). Effects of student and school factors on five measures of school disorder. Justice Quarterly 18, 911Y948. *Wexler, H. K., Melnick, G., Lowe, L. & Peters, J. (1999). Three-year reincarceration outcomes for Amity in-prison therapeutic community and aftercare in California. Prison Journal 79, 321Y336. Wilson, D. B. (2001). Meta-analytic methods for criminology. Annals of the American Academy of Political and Social Science 578, 71Y89. Wooldridge, J. M. (2004). Statistical significance is okay, too: Comment on Bsize matters.^ The Journal of Socio-Economics 33, 577Y579. *Wright, B. R. E., Caspi, A., Moffitt, T. E. & Silva, P. A. (2001). The effects of social ties on crime vary by criminal propensity: A life-course model of interdependence. Criminology 39, 321Y352. *Wright, J. P., Cullen, F. T., Agnew, R. S. & Brezina, T. (2001). BThe root of all evil^? An exploratory study of money and delinquent involvement. Justice Quarterly 18, 239Y268. Zellner, A. (2004). To test or not to test and if so, how? Comments on Bsize matters?^ The Journal of Socio-Economics 33, 581Y586. Ziliak, S. T. & McCloskey, D. N. (2004). Size matters: The standard error of regressions in the American Economic Review. The Journal of Socio-Economics 33, 527Y546.

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.