Understanding and Interpreting Effect Size Measures

14 downloads 0 Views 2MB Size Report
Effect Size Measures. Craig Winston LeCroy and Judy Krysik. Statistical methods are the haUmark of quantita- tive research. Examining whether a result is.
RESEARCH NOTE

Understanding and Interpreting Effect Size Measures Craig Winston LeCroy and Judy Krysik

S

tatistical methods are the haUmark of quantitative research. Examining whether a result is statistically significant is standard content in social work research statistics courses. In all of social science, statistical significance testing has been the accepted procedure for examining results. Despite ongoing efforts aimed at encouraging researchers to report some index of magnitude that is not directly affected by sample size—for example, effect size—statistical significance testing appears to remain the standard. In 1994, the APA Ptthlicatiort Marmal provided "ejicouragement" to authors to report effect sizes with litde impact (see for example Keselman et al., 1998; Kirk, 1996; Thompson & Snyder, 1998). The current Publication Manual of the American Psycholoj^ical Association (2001) states that "it is al-

most always necessary to include some index of effect size or strength of relationship" (p. 25). This position was influenced by the APA Task Force on Statistical Methods that recommended researchers report alpha levels and effect sizes (Wilkinson &Task Force on Statistical Inference, 1999).This was also the stance advocated by Dr.Jacob Cohen (1990), statistical power expert, who argued that "the primary product of a research inquiry is one of measures of effect size, not p values" (p. 12).The field has slowly responded and effect sizes are increasingly more visible (Fidler et al..2005;Vacha-Haase & Thompson, 2004). Indeed, some journal editors, especially in psychology, now require authors to report effect size measures (Harris, 2003; Snyder, 2000; Trusty, Thompson, & l'etrocelli, 2004). The reporting of effect size measures is also increasing in social work journals, however, it is still common to find studies void of effect size indices (for example, Claiborne, 2006; Engelhardt, Toseland, Goa, & Banks, 2006; Padgett, Gulcur. & Tsemberis, 2006; Perry, 2006). There do not appear to be any social work journals that require effect size reporting.

CCCCode: 1070-S309/07 {3,00 O2007 National Aisoclation of Soctal Workers

Whereas many researchers are familiar with the use of effect size measures for power estimation and sample size determination, they are less famihar with using such measures to interpret their findings. Many researchers fail to understand that there are different types of effect size measures with different methods of calculation within each type and that the interpretation of effect size \*aries depending on the measure used. The purpose of this column is to further the basic understanding of effect size measures: what they mean, how they differ, and to suggest one way of presenting outcome data for easier interpretation. It is because effect sizes are becoming part and parcel of the social science research enterprise that they must be clearly understood. In particular, this column is focused on the use of effect size measures when presenting che results of intervention research. THE BASICS

Measures of effect size provide critically different information than alpha levels. This is because effect size addresses tlie practical importance of the results as assessed hy the magnitude of the effect. It is well known that one can obtain a statistically significant result that is not practically significant. One basic misunderstanding in statistical analysis is thinking that an observed p value that is considered highly significant, for example p = .0001, also reflects a large effect.The p value simply represents the hkelihood that a finding is due to chance or sampling error and reveals nothing about the size of the effect. One important reason to use effect size measures is that they can help the researcher consider the importance of his or her findings apart from statistical significance. This degree of analysis is too often neglected, as reflected in comments by Abelson (1995), who noted that statistical significance tesK should be used for "guidance rather than sanctification" (p. 9).

243

Rosenthal and colleagues (2000) demonstrated how effect sizes can be used in conjunction with significance levels to arrive at an inference. Consider a research finding whereby the researcher reported a nonsignificant p value and unknowingly had a large effect size. In this instance the researcher would conclude that there was no effect for the intervention, [f this were an assessment of the effectiveness of a new drug for reducing cancer cells, it would not take long to realize that tlie conclusion represents a serious mistake. Indeed, as frequently found in many studies, being underpowered—that is, not having a large enough sample to obtain statistical significance—is not uncommon. Therefore. it is likely that a researcher may falsely conclude a finding of "no difference" when, in fact, an effect size would show there was a meaningful difference. A conclusion of no difference based on statistical significance testing miglit lead the researcher in the cancer drug study to stop his or her investigation, but a meaningful effect size might suggest just the opposite—the importance of rephcating the study with a larger sample that is sufficiently powered and looking across similar studies to assess the reliability of the effect size findings. Rosenthal and colleagues described this problem in inference as the "failure to perceive the practical importance of nonsignificant results" (p. 4). In contrast,a researcher might have a large sample and obtain a statistically significant fmding but observe only a small effect. A comnion conclusion is that this is a very significant result when it may be oflittle practical im^portance (Abelson, 1995;Cohen, 1990).This is not to suggest that all small effects are oflittle practical importance.Thc practical relevance of an effect size must be judged within the context of the problem studied. We should not be fooled by the sniall ctlect that when multiplied across a large population can produce a large impact. What can be gained from using effect sizes in conjunction with p values is that researchers learn to resist the common practice of making automatic, anti-null decisions when a p value is less than the standard rejection level and pro-null decisions when p is greater than the rejection level (Nelson, Rosenthai, & Rosnow. 1986). Many researchers acknowledge that there is no justification for a sharp distinction between statistically "significant" and"nonsigruficant'"results (Abelson, 1995;Cohen. 1990; Kosenthal et al., 2000) Indeed, the standard rejection levels set by Sir Ronald Fisher were

244

chosen because they were convenient points on a continuous scale (Yates, 1990). For this reason. many statisticians recommend reporting exact ;) values in combination with effect size measures.The main advantages of effect sizes are that unlike the statistical significance test the value is independent of the sample size, they are expressed in standardized units that facilitite comparison across studies, and they represent the magnitude of the differences, which is what is meaningful to researchers and their audiences. Effect size interpretations should not be void of measurement considerations such as score reliability, especially when comparing etfect sizes across independent studies. For further discussion of this issue see Baugh (2002). DIFFERENT MEASURES OF EFFECT SIZE

Two common types of effect size measures are the standardized mean difference and the effect size correlation. In addition to different types of effect size measures, there are different methods of calculation within types. The most commonly used effect size is a measure ot the standardized mean difference known as Cohen's (/.This is the effect size measure commonly used in power analyses. Cohen's d, as illustrated in the fallowing equation, is calculated by subtracting the inean of the control group {M^} from the mean of the treatment group (A/^) at posttreatment and then dividing by the pooled standard deviation of the two groups. The pooled standard deviation is the square root of the average of the squared standard deviations for each group. If the variances of the two groups are equal, then the standard deviation of either group can be used (Cohen, 1988). If the groups are of unequ,il sizes, then the pooled standard deviation is calculated using the formula shown below.

s ..^

".- 2

for where groups of unequal sizes. The following example illustrates the calculation of Cohen's d for groups of equal sizes where the experimental group has a posttreatment mean of 26 on a 42-point scale, the control group a mean of 22, and the pooled standard deviation is 6. The standardized effect size or Cohen'srfwould be equal to

Social Work Research VOLUME i i , N U M B E R 4

DIU^KMHILK 100-

.66 (see below) .This value represents the difference between the two means expressed in terms of their common standard deviation, thus a (/of .66 indicates that two-thirds of a standard deviation separates the two means. If the difference in the two means is positive, it is in the direction of improvement, and if it is negative, it shows deterioration. 26 - 22

4

d = — ^ - = -, = 0.66 6 6 Not all effect sizes are calculated in this way, use the same metric, or result in the same interpretation.Table 1 presents a summary of various effect size measures and their interpretation. The effect size correlation for intervention research can be computed directly as the point-biserial correlation between a dichotomous independent variable and a continuous dependent variable (obtained using the CORR procedure in SPSS). By squaring the correlation coefficient, the proportion of the variance in the dependent variable explained by the independent variable is obtained.Thus, r^ is sometimes used as a way to characterize the strength of the effect size correlation. For example, squaring a point-biserial rof .3 results in an r of .09, or 9% of the variance explained. Like Cohen's d, the effect size correlation is a scale free measure. The problem confronted most often with effect size measures is the interpretation of their value. Although there are no generally agreed upon standards for interpreting the magnitude of effect sizes, researchers have ingrained Cohens (3988) notion

that a small effect is .2, a medium effect is .5, and a large effect is .8 (Dunst, Hamby, & Trivette, 2004). Often r and d are confused and Cohen s guidelines are applied to r, which is problematic because rand d have different metrics. For example, an r of .5 corresponds to a t/ of 1.15. Many researchers incorrectly assLune that you cannot obtain a d of over 1.0. Cohen's d can be calculated from r by multiplying r by 2 and dividing by the square root of 1 minus r'. Conversion tables for d and r are readily available (see, for example, Dunst et al.). An additional problem in interpreting the strength of the effect size correlation as noted by Rosenthal and colleagues (2000) is that r"*suffers from the expository problem that important effects can appear to be smaller than they actually are in terms of practical significajice", (p. 16). This was precisely the problem confixsnted by one of the authors when presenting the results from an experimental study using omega squared, an effect size similar to r' and recommended for use with analyses of variance (ANOVA) (Keppel &Wickens, 2004). In this study effect sizes of .03 to .07 were reported (LeCroy, 2004).To many, this range seemed "too small," yet when using a measure of strength of an effect size such as omega squared, Cohen notes a small effect is .01 and a medium effect is .06. Two problems immediately emerge from this experience. First, even if an effect size appears small it might still have important practical implications. Second, reported effect sizes may be well within the range typically found for similar intervention studies, yet this can go unrealized because the effect

TcJble 1. A Summary of Various Effect Sizes and How to Interpret Them Effect Size Measure

How t o Interpret '\ i i R . i ' . n r i - 1 1 1 K i . i r i n , i i r [ i i . i L ' . i i i n u l i

iint

b c t w t L - i i [iiL- n v . n n i i i i i , i n d C f n i r u l

g f i - i i - i p n t c a i i s i;>.j)ii;sM:i.i i i i. M . i r i i i . i i i i i l i v i . t i i u i i i i i H i s m

I | I | 1 I K ; U uil I ' i ••.iniiili

•!/,•. ( . , 1 . i i l . : ; , t !

,•. l i n

ijilli. i. I K T st.ind.mi-

iTcd Store units. Mosi comnKin mcdnid for reporting efTcct .sizxrs. Can be difiiculi m inicrprc[ because rbe range is beyond 0-1. Tbis is tbe effect size commonly used in nieta-analysis. Point-biserial r

Computed as tbe correiation between a dicbotonious independent variable and a coniiniious dependent variable. Effects appear to be smaller [ban ihey actually are because tbe point biscnaj r is sometimes mistakenly interpreted as baving tbe same metric as Cohen's d. Index of goodness of prediction (tbe proporHon of" variation among outcome scores tbat is attributable to variation in predictor .score.s. Kffccts can appear lo be smaller (ban tbey actually are in terms of practical importaticc.

Omega .-iquarcd

Binomial EfFca Size Display

Omega squared is arelativemeasure of strengtb reHecting tbe proportional amount of the total poptrlation variance ac*:ounted For by tbe cxptrimcnial ireamicnt.v. Often rctcrrLi.! to as tbe proponion of variation "cxpiained" by the treatment manipulation in an experiment. Like i^, effects can be interpreted as smaller tban tbey actually arc in terms of imponance. Recasts effect size or ttcatment ttiagnitude as a 2 x 2 contingency table. Creates a display of cbc practical importance of tbe effect allowing for easy incetprctation. Can be used in addition ro other effect size measures to aid interpretation.

LECROY AND KRYSIK / Undentanding and Interpreting Effect Size Measures

245

size measure is not understood, A critical problem with using a common effect size measure such as Cohens d is that it is sometimes interpreted mechanically. For example, Cohen's guidelines define d - .2 as a small effect. Even small effects, however, may be practically iniportant,This is why additional methods of displaying effect sizes can assist the researcher in interpreting the meaningdilness of her or his fmdings.

f test statistic was used to obtain a point-biserial r using the following formula:

for t = 2.455 (2.455)-

IMPROVING EFFECT SIZE INTERPRETATIONS: THE BINOMIAL EFFECT SIZE DISPLAY

(2,455)^+ 116

A.srecoimnciidcd by Roseiuhal etal.(2l)(K)),the use ofthe r-based Dinomial Effect Size Display (BESD) avoids the problem of an effect size such as r^ that might be misinterpreted because it seems small.The r-based UESD is a 2 x 2 contingency table where the rows represent the values ofthe dichotomous independent variable (for example, treamient and control), and the columns represent the dependent variable displayed as a dichotomous outcome (for example, improved compared with not improved). A continuous dependent measure can be presented in dichotomous categories for this purpose.The row and column totals are set ro 1 Ofl.The r-ba.sed DESD illustrates the difference in treatment success if onehalf of the population received one condition and one-half received the other condition. The BESD assumes a 5()*>ii base rate for both experimental and control groups—this is an assumption created to illustrate the impact of the effect. Thus, the question the r-based BESD answers is:"What would the correlationally equivalent effect of the treatment be if 50% of the participants had the occurrence and 50% did not, assuming groups of equal size?" If there was no difference in effect among the two categories of the independent variable (that is. point-biserial r = 0), each cell in the table would be equal to 50%,To calculate a BESD with margiiiaU of 100, with the point-biserial r for two groups of equal size and with equal variances, the following formula is used (Rosenthal et al,, 2000)[Treatment group success rate - ,50 + r/2 and control group success rate = ,50 - r/2. We illustrate calculation of the BESD using summary statistics from LeCroy (2004).This study examined body image as an outcome, where the impact of a gender-specific intervention was compared with a no-treatment control group. The reported effect size from this study was ,05 (omega squared), again often interpreted as minimal. The

246

The point-biserial r is then inserted into each of the following equations to create the 2 x 2 BESD. 0.50 + r/2 =0.50 + (0.22/2) = 61% 0,50 - r/2 =0.50 - (0,22/2) The use of the BESD allows the researcher to answer the question:What is the effect ofthe treatment on success? ln this particular study, the success rate changed from 39% m the no-treatment control group to 61% in the gender-spccific intervention group, a difference in rate of improvement of 22%. Note that the rate of improvement in the treatment group compared with the control group (61%- 39% ~ 22%) corresponds to ix lOO.orthepoint-biscrial correlation of ,22.What is useful about the BESD is that it provides the difference between success rates, whereas r' as a measure ofthe strength ofthe effect size correlation is not very intuitive. Using the BESD provides an easy way to better understand the data. Indeed, telling the audience that the experiment had an omega-squared effect size of ,05 does not convey much, whereas, a claim that the treatment increased the success rate from 39% to 61% is much more meaningful. The BESD can be an important asset because it increases understanding, interpretability, and comparability—three guidelines for making statistics meaningful (May, 2004) .The BESD like niost statistical methods also has limitations (Thompson & Schumacker, 1997). SOME FINAL COMMENTS ABOUT INTERPRETING EFFECT SIZE MEASURES

Although the use of effect size measures is strongly recommended, die correct interpretation of effect sizes remains an important issue. For instance, what magnitude of difference represents a meaningful outcQme?The importance of not overlooking small

Sociid Work Research VOLUME } I , N U M B F R 4

effects is addressed by Prentice and Miller (1992), who noted in their article titled "When Small Effects are Impressive" that even sniaU effects can be worthy of careful consideration. For many researchers it takes experience with the substance of the research area to answer this question. Additional efforts by researchers have taken the next step and asked if the magnitude of the change produced by the treatment is clinically significant and have developed measures to assess"clinicalsignificance"and "reliable change" (sec Jacobson & Truax, 1991; Ogles, Lambert, & Masters, 1996; Wilson, Becker, & Tinker, 1997). An important goal promoted in this column is greater discussion of effect sizes, the practical importance of the fmdings, and how the fmdings compare with those of other similar studies. Rarely do we observe researchers discuss an effect size obtained or reviewed in other studies. Two issues bear consideration: (1) when there is little manipulation of the independent variable and (2) when the independent variable is difficult to influence. Rfsearchers often fail to consider these points when comparing studies. If snidy A obtains an effect size of .65 comparing a treatment group with a no-treatment control group, that is considerably different than if study B obtains the same effect size comparing a treatment group with a control-placebo group. In this situation study B is very impressive because we do not expect to see as large of a difference between groups when comparing a placebo group with a treatment group. How big does an effect size need to be to be considered important? In most cases, the effect size does not need to be as big as many would believe. Rosenthal and colleagues (2000) argued that the social and behavioral sciences are "doing much better than they have thought" (p. 25). In examining the "how big" question, they reviewed a study that found the regular use of aspirin can reduce the risk of heart attacks (Steering Committee of the F*hysician's Health Study Group, 1988).The study was a five-year randomized trial of male physicians. Half of the group (« = 11,034) were given an aspirin every other day, and the other half (« = 11,03"/) were given a placebo. Interestingly, the study was ended early because of ethical considerations—it was clear that aspirin prevented heart attacks, and therefore it would not be ethical to continue the placebo control group. The statistical difference between the two groups was significant [X'Cl. N = 22,071) = 25.01, ;j < .000001]. As Rosenthal

suggested, however, social and behavioral scientists would not be impressed by the seemingly small effect size, that is, r = .034 and r' = .001 .Yet, the use of the r-based BESD provides a better framework for interpreting this result. Recast, these data show that 3.4% fewer people would experience a heart attack (aspirin heart attack group 48.3% compared with the placebo heart attack group, 51.7%). In this context the findings do seem important. CONCLUSIONS AND RECOMMENDATIONS

The use of effect size measures in conjunction with statistical significance testing is an imponant consideration, especially in social work where underpowered intervention studies are more common. By reporting effect sizes investigators can facilitate subsequent meta-analyses, help future researchers establish outcome expectationsfromsimilar research, and better understand how a study's fmdings fit into existing research literature. We recommend the following procedures in analyzing results from mtervention research: • The results of intervention research should include presentation and interpretation of both statistical significance and effect sizes. • Regardless of the statistical methods used {t test,ANOVA, or multivariate procedures),Cohen's (/should be presented in the results. • In addition to Cohen's d, the r-based BESD should be calculated to frirther interpretation of effect size results. • Results should be interpreted with an understanding of what effects other similar studies have found. REFERENCES Abdson.R. l> (199^). Statistia as principled aTgii Hillsdalc, NJ: Lawrence Erlbaum. American Psychological Association. (19^4). Ihtblication manual of the American Psychological Association (4tli

ed.j.W^sbington, DC: Author. American Psychological Association. (2001). Publication manyial of the American Psychological Association (5th

ed.). Washington, DC: Author, Baugh, F (2(H)2). Correcting effect sizes for score reliability: A reminder that meisurement and substantive is.sues are linked inextricably. Educational and Psychological Measurement, 62. 254-263. Claiborne, N. (2006). Efficiency of a care coordination model: A randomized study with stroke patients. Research on Social Work Practice, 16, 57—66, C o h e n . J. (1988). Statistical ponvr analysis for the hehavioral

sciences. New York: Academic Press. Cohen,J. (1990).Things 1 bave learned so far. American Psychologist, 45. 1304-1312,

LFCBOY AND KRYSIK / Ihdertuiniiing and Inifrfireting Effeet Size Measures

247

Dunsc. C.J.. Hamby, D.W.. &Trivett, C. M. (2(M)4}. Guidelines for calculating efFect sizes for practicebased research syntheses. Centencope: Bvideiue-biised Approaches to Early Childliood Oevclopmeui, 3, I-IO, Engclhardt.J. B.,Toseland. K. W. Goa.J,,& Banks. S. (2006), Long-term effects of outpatient geriatrif evaluation and irLinagement on htfalth care utilization. COSE and survival. Research on Soda! Work Practice. 16, 2l)-27. Fidier, R, Cuiiiniing, G.,Th(>mason. N,, Pannuzzo. D.. SmichJ., Fyffe. P., ct al. (2O()5)."niward improved statistical reporLin^ in the Joimiul ofCmisultirig and Cotitiscling Psyclu>U\i;>Y.Jourtial of Consulting md Coumeling Psychoh^y, 73, 13(S-143. Harris, K. {2W)i). Journal of Educational Psydiotog}': Instructions for authors.Jowrnii/ of BduaitioTiat Psychology. 95. 2(,) 1. Jacobson.N.S..&-Truax. P. (1991), Clinical significance: A statistical approach to defining mcanirigtiil change in psychotherapy research. Journal of dmsultiitg and Clitiiail Psychohiiy. 59. 12-19. Keppel, G.. & Wickens.T. D. (2004). Design and analysis:A researcher's Itaiidhook. Engtewood Clifls, N|: Prentice Hall, Ki'sclnuii, H-J,. Hiibeny. C.J.. Ux. L. M.. Oiejnik, S.. Cribbie. R,. Oonaluie. B,. et ai- {lyyS}. Statistical practices at educational researchers; An analysis of their ANOVA, MANOVA and ANCOVA analyses. Review of Educational Reseiirch, 68. 35l)-386, Kirk, R. E. (1996). Practical significance: A concept whose time lias coine. Edt4cational and Psychological Measurement. 56, 74ft-75

Suggest Documents