Effect Size Estimation: Factors to Consider and Mistakes ... - CiteSeerX

26 downloads 158 Views 117KB Size Report
College of Business Administration, University of Missouri-St. Louis, St. Louis, .... wellness program given its use only accounts for a small percent of variance in ...
Journal of Management 2003 29(1) 79–97

Effect Size Estimation: Factors to Consider and Mistakes to Avoid James A. Breaugh∗ College of Business Administration, University of Missouri-St. Louis, St. Louis, MO 63121, USA Received 19 June 1999; received in revised form 25 January 2002; accepted 15 March 2002

In recent years, there has been an increase in the reporting of effect size information. This paper (a) provides a review of commonly used effect size indices, (b) highlights some common misconceptions about effect size estimates, and (c) introduces a number of infrequently used effect size measures which depending upon the research context and the audience may better communicate the importance of the relationship between two variables. © 2002 Elsevier Science Inc. All rights reserved.

In conducting a study, a researcher generally is interested in estimating how two or more variables (e.g., number of hours worked, work/family conflict) are related in a given population (e.g., working spouses with dependents). Because a researcher rarely has access to the entire population of interest, a sample of this population is studied. Given such sampling, a null hypothesis significance test is used to estimate the probability that a relationship found between variables is not due to chance. Researchers frequently have used the results of a significance test to assess whether their findings are important. Although the value of significance testing continues to be debated, most experts (e.g., Kirk, 1996) believe it is inappropriate to make a judgment about the importance of a relationship between variables based upon the result of a significance test (with a large sample, almost any relationship will be statistically significant). At least partly in response to criticisms of significance testing, the reporting of effect size information has been increasing (Rosenthal, Rosnow & Rubin, 2000). Although this increase is generally seen as desirable, the simplistic manner in which effect size estimates have been interpreted is cause for concern. It is not uncommon for researchers to interpret effect sizes in ways that may impede scientific progress (Fichman, 1999) and may diminish the importance attached to research by the general public (Rosenthal, 1994). The goals of ∗

Tel.: +1-314-516-6287; fax: +1-314-516-6420. E-mail address: [email protected] (J.A. Breaugh). 0149-2063/02/$ – see front matter © 2002 Elsevier Science Inc. All rights reserved. PII: S 0 1 4 9 - 2 0 6 3 ( 0 2 ) 0 0 2 2 1 - 0

80

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

this paper are (a) to provide a review of commonly used effect size indices, (b) to highlight some common misconceptions about effect size estimates, and (c) to introduce a number of infrequently used effect size measures which may help researchers make sense of their findings and help them better communicate these findings to their audience. Given our focus is conceptual rather than computational, we will be selective in providing formulas. We recognize that much of the information presented in this paper will not be new to readers with considerable statistical expertise. However, we believe that many readers will find the information to be valuable.

Effect Size Indices: Some Basic Issues Cohen defined an effect size as “ ‘the degree to which the phenomenon is present in the population’ or ‘the degree to which the null hypothesis is false’ ” (1988: 9–10). Given that a researcher rarely has data concerning the entire population of interest, as typically used, an effect size refers to a sample-based estimate of the size of the relationship between variables (Rosenthal, 1994). Although there are several effect size indices, authors (e.g., Kirk, 1996; Richardson, 1996) frequently have categorized them as being of one of two types—measures of the standardized difference between group means and measures of explained variance. We believe this categorization is not particularly useful and sometimes is misleading. To understand our position, it is useful to briefly describe these two categories of effect sizes. Effect Size Indices that Reflect the Standardized Difference Between Group Means Effect size indices that reflect a standardized difference between group means (e.g., Cohen’s d, Glass’s ∆, Hedges’s g) generally are used for assessing the degree of association between an independent variable and a dependent variable in an experiment (readers interested in how d, ∆, and g differ computationally should refer to Rosenthal, 1994). The most commonly used of these indices is d (Cohen, 1988) which reflects the difference between two group means divided by their pooled within-group standard deviation. For example, if an experimental group and a control group differed by an average of two points on a dependent variable and the pooled within-group standard deviation was five, then d would be .40 (i.e., the groups differed by .40 standard deviation units). By standardizing on the dependent variable, d allows a comparison of effect sizes across studies that used dependent variables measured on different scales. Effect Size Indices that Reflect the Percent of Variance Accounted for Variance accounted for measures such as r2 and η2 (eta-squared) involve a proportion that reflects how much variability in one variable is associated with variation in a second variable. Frequently, researchers have estimated the percent of variance shared between two variables by computing the correlation between them and squaring it. Measures of explained variance standardize both variables being examined. This dual standardization permits cross-study comparisons of the degree of association between predictor and criterion variables that have different metrics. For example, one can compare the variance accounted

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

81

for in job performance by cognitive ability in two studies that used different measures of both variables. A limitation of standardizing the predictor variable is that it confounds “differences in variance between studies with differences in effect sizes” (Judd, McClelland & Culhane, 1995: 439). Standardized Mean Difference and Variance Accounted for Measures: Additional Commentary Although standardized mean difference measures (e.g., d) and variance accounted for measures (e.g., r2 ) have been presented as representing different effect size indices, this treatment is misleading. In fact, d and r are algebraic transformations of each other {r = d/(d 2 + 4)1/2 } when the predictor variable has two values (Cortina & Nouri, 2000). Although d and r2 are commonly reported effect size indices, they have been criticized for doing a poor job of conveying the strength of the relationship between variables (Abelson, 1995). For example, d reflects the degree of relationship between two variables in terms of standard deviation units. If a criterion variable reflects certain phenomena (e.g., dollars), it may not be difficult to interpret a d-value (e.g., if a standard deviation reflects US$ 200, then a d-value of .40 represents a group difference of US$ 80). However, for many criterion variables (e.g., job stress), a difference reported in standard deviation units may not be easy to interpret. Explained variance measures also have been criticized for being poor communication devices. This communication problem stems from the fact that, “because r2 (and all other indices of percent variance accounted for) are related only in a very nonlinear way to the magnitudes of effect sizes that determine their impact in the real world” (Hunter & Schmidt, 1990: 199), such indices commonly result in a researcher underestimating the importance of a relationship (Rosenthal et al., 2000, also addressed this issue). Such underestimating can have important consequences for theory development (e.g., a researcher inappropriately drops a key variable from a model) and practice (e.g., a firm discontinues a wellness program given its use only accounts for a small percent of variance in absenteeism). To demonstrate how a measure of explained variance can lead to underestimation of the importance of the relationship between two variables, consider the case of the heart bypass surgery mortality rates which were reported for two Philadelphia hospitals (Anders, 1996). The correlation between the choice of a hospital and survival from surgery was .07. Given that the choice of hospital accounted for less than 1% of the variance in survival rate, some might see the choice as being unimportant for a patient facing bypass surgery. But is it? The mortality rate at one hospital was 3.60%; the mortality rate at the other hospital was 1.40%. Thus, the death rate was 257% higher at one hospital than the other (alternative methods for assessing the magnitude of a relationship between dichotomous variables will be addressed shortly). Although the point exemplified in the hospital example is not new, researchers frequently have overlooked the fact that variance accounted for measures can result in undervaluing the theoretical and practical significance of the relationship between variables. Given this fact, a few authors (Bobko, 1995) have suggested that researchers should report a correlation coefficient as a measure of effect size. Alleged advantages of r are that it is a more familiar statistical value than d and that it is a bounded index (d has no fixed range) which may make interpretation easier. Although in some situations these may be advantages of using

82

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

r as an effect size instead of d or r2 , these are not always real advantages. For example, research has shown that individuals are not particularly good at evaluating how r reflects the degree of association between two variables (Oakes, 1986). Similarly, although in theory a correlation coefficient is bounded by values of ±1.00, as will be discussed, in many studies the maximum possible range of r may be much less.

A Different Perspective on Effect Sizes Rather than categorizing effect size indices as either being standardized measures of mean differences or measures of variance accounted for, it may make more sense to view them from a different perspective. For example, distinctions can be made concerning whether or not the independent variable and/or the dependent variable are continuous variables. Given that we are not familiar with an ideal taxonomy of effect sizes, we believe that it makes the most sense to start with the simplest case and progress to more complex ones. The Case of a Dichotomous Independent Variable and Dichotomous Dependent Variable The heart bypass surgery study discussed earlier is an example of a research design that involved dichotomous independent (Hospital A vs. Hospital B) and dependent (died during surgery vs. survived) variables. Frequently, in analyzing such data, a researcher will compute a correlation coefficient (a phi correlation) or a chi-square value. Although sometimes viewed as distinct tests, a phi coefficient and a chi-square value are algebraic transformations of one another (phi = {χ 2 /N }1/2 ). Given readers are likely to be more familiar with interpreting the magnitude of a correlation coefficient than a chi-square value, it makes sense for a researcher to focus on a phi coefficient in discussing the degree of relationship between two dichotomous variables. A limitation of phi as a measure of effect size is that its possible range is affected by the distributions of the variables (McDonald, 1999). Unless the two variables have the same marginal distributions, phi cannot equal either ±1.00. This is due to the fact that the diagonal cells in a 2 × 2 matrix cannot both have values of zero unless the marginal distributions are identical. When the marginals differ greatly, the maximum possible phi can be much less than ±1.00. Tables 1 and 2 present data demonstrating this fact. The data in Table 1 show that for a hypothetical sample the correlation between employee gender and quitting is .20 (p < .05). If a researcher were to compare this correlation against a correlational value of ±1.00, he/she might conclude that the relationship between gender and turnover was modest. However, what if one compared the phi coefficient of .20 against the maximum possible correlation given the marginal distributions of the data? Given the marginal distributions (50:50 vs. 90:10), the maximum possible phi is .33 (the formula for computing the maximum possible phi is provided in Table 2). In summary, how one evaluates the strength of the relationship between gender and turnover may differ depending upon whether one uses .33 or 1.00 as a comparison standard. In certain disciplines (e.g., medicine), it is common for researchers to report a risk ratio as an effect size measure (a ratio of 1.00 reflects equal risk). In the bypass surgery study, the

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

83

Table 1 The relationship of employee gender and voluntary turnover Stayed (nc1 = 180, pc1 = .90)

Quit (nc2 = 20, pc2 = .10)

Male (nr1 = 100, pr1 = .50)

n11 = 84 p11 = .42

n12 = 16 p12 = .08

Female (nr2 = 100, pr2 = .50)

n21 = 96 p21 = .48

n22 = 4 p22 = .02

Note 1: nc1 stands for sample size of column 1; nr1 stands for sample size of row 1; pc1 stands for the probability of being in column 1; pr1 stands for the probability of being in row 1. Note 2: Statistical values: χ 2 = 8.00 (p < .05). Phi correlation = .20. Risk ratio for a male quitting = 4.00 (.08/.02). Odds of a male quitting = .1905. Odds of a female quitting = .0417. Male/females odds ratio for quitting = 4.57.

risk ratio of the two hospitals for dying during surgery was 2.57 (3.60%/1.40%). Although a risk ratio is easy to understand, it has limitations. One of these limitations (Fleiss, 1994, discusses others) is that a risk ratio is not symmetric. That is, if instead of focusing upon those who died during surgery one focused on those who survived, a researcher would come up with a risk ratio of 1.02 (98.6%/96.4%) which conveys a different impression. Instead of reporting a risk ratio or phi coefficient as an effect size, many statisticians (e.g., Haddock, Rindskopf & Shadish, 1998; Pampel, 2000) have advocated reporting an odds ratio. A desirable property of an odds ratio “is that its possible range of values is not influenced by the marginal distributions of the variables” (Rudas, 1998: 10). Haddock et al. (1998) provided several examples of how the marginal distribution of variables can affect the magnitude of phi coefficients. One can compute an odds ratio from probability values (odds ratio = p11 p22 /p12 p22 ) or from frequency data (odds ratio = n11 n22 /n12 n22 ). However, to better understand the meaning of an odds ratio, the use of a different formula (odds ratio = (n11 /n12 )/(n21 /n22 )) is instructive. In this context, odds (e.g., n11 /n12 ) indicate how often something occurs relative to how often it does not occur. For example, for

Table 2 The maximum possible relationship of employee gender and voluntary turnover given the marginal distributions of the data in Table 1 Stayed (nc1 = 180, pc1 = .90)

Quit (nc2 = 20, pc2 = .10)

Male (nr1 = 100, pr1 = .50)

n11 = 80 p11 = .40

n12 = 20 p12 = .10

Female (nr2 = 100, pr2 = .50)

n21 = 100 p21 = .50

n22 = 0 p22 = .00

Note 1: nc1 stands for sample size of column 1; nr1 stands for sample size of row 1; pc1 stands for the probability of being in column 1; pr1 stands for the probability of being in row 1. Note 2: Statistical values: χ 2 = 22.22 (p < .05). Phi correlation = .33. It is not possible to compute a risk ratio for a male quitting an odds ratio for a male quitting given a zero value in cell 22. Note 3: Formula for maximum possible phi (from McDonald, 1999). Maximum possible phi = {(pr1 )(1 − pc1 )/(pc1 )(1 − pr1 )}1/2 . For these data, maximum phi equals .33{(.50)(1 − .90)/(.90)(1 − .50)}1/2 . Data must be arranged so that pr1 < pc1 .

84

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

the data reported for males in Table 1, the odds of a male quitting relative to staying are .19 (16/84). For females, the odds of quitting relative to staying are .04 (4/96). Therefore, the male/female odds ratio for the data reported in Table 1 is 4.57 (if we had chosen to divide the female odds by the male odds, the odds ratio would have been .22). An odds ratio can equal any nonnegative number. Given that an odds ratio of 1.00 reflects that there is no relationship between two variables, the farther the odds ratio is from 1.00, the stronger the association between variables. Because an odds ratio is symmetric around 1.00, odds ratios of 5.00 and .20 reflect equivalently strong relationships (i.e., one value is five times 1.00; the other is 1/5). A limitation of an odds ratio is that it cannot be computed when one of the cell values equals zero. In such cases, it is recommended that a value of .5 be added to each cell value. Another limitation of an odds ratio is that compared to a phi coefficient and a risk ratio, its meaning “is not intuitively clear” (Fleiss, 1994, p. 251). In this regard, Pedhazur (1997) provided examples of odds ratios being incorrectly interpreted as probabilities in published studies. In order to address the issue of how to interpret an odds ratio, some researchers have provided guidelines for interpreting the magnitude of an odds ratio (e.g., “As general rules of thumb, odds ratios close to 1.0 represent a weak relationship between variables, whereas odds ratios over 3.0 for positive associations (less than one-third for negative associations) indicate strong relationships,” Haddock et al., 1998: 342). According to these guidelines, the odds ratio reported in Table 1 between gender and turnover represents a “strong” relationship. Given that for commonly used effect size measures such as d and r a value of zero reflects no association between variables, some authors (e.g., Haddock et al., 1998) have suggested that the fact that an odds ratio of 1.00 reflects no association may cause confusion. To address this issue, some experts recommend reporting a log odds ratio which is derived by first computing the natural logarithms of the odds values. For a log odds ratio, a zero value reflects no association between variables. A log odds ratio has upper and lower bounds of infinity. Although a log odds ratio addresses the issue of zero reflecting no association, one can question whether introducing logarithmically transformed variables really aids interpretation. In this section, we discussed a number of effect size indices. Although some experts advocate the reporting of an odds ratio or a log odds ratio, these ratios are rarely provided. This may be partly due to organizational researchers being unfamiliar with them. However, it also is likely due to researchers believing that an odds ratio or a log odds ratio does not adequately convey the strength of a relationship to many readers. In contrast to an odds ratio, the reporting of a phi coefficient is more common. As noted, a weakness of a phi coefficient is that its possible range can be substantially restricted if the marginal distributions of variables are unequal. A way to address this issue is for a researcher also to report the maximum possible phi value. Of the effect size indices discussed, a risk ratio is the easiest to understand. At this point, it seems appropriate to offer the following recommendations. If possible, it is advantageous to provide a 2 × 2 contingency table such as Table 1. Doing this allows a reader to compute a risk ratio, a phi coefficient, or an odds ratio. Alternatively, a researcher should consider reporting more than one effect size measure. When space constraints do not allow for either of these options, a researcher should consider the sophistication of his or her audience in choosing a single effect size to report.

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

85

The Case of a Dichotomous Independent Variable and a Continuous Dependent Variable When a study involves a dichotomous independent variable and a continuous dependent variable, researchers frequently have reported Cohen’s d (for an experiment), a correlation coefficient (i.e., a point-biserial correlation), or a squared correlation as a measure of effect size. Given these effect size indices already have been discussed in some detail, it suffices to state that each is a technically sound measure with strengths and weaknesses. An alternative to reporting d, r, or r2 is reporting an unstandardized regression coefficient. An unstandardized regression coefficient indicates the expected difference in the dependent variable given a unit change in the independent variable (as noted by a reviewer of this paper, when coded in a traditional manner, the unstandardized regression weight reflects the difference in means between the two groups). Judge, Cable, Boudreau and Bretz (1995) used unstandardized regression weights “to illustrate the practical effects of the predictors of compensation” (p. 501). Given a number of their predictor variables were dichotomous and not equally distributed (e.g., women comprised 7% of the sample; 1% of the sample held a position on a board of directors), it was not surprising that some of these predictors were not strongly associated with compensation. For example, gender correlated .20 with compensation and board membership correlated .11. In terms of variance accounted for in compensation, 4% and 1% may not be impressive figures. However, Judge et al. used the regression weights for these variables to demonstrate their effect size. These authors showed that being male was linked to an extra US$ 6575 in annual compensation and that being on a board of directors was associated with an additional US$ 41,772 in compensation. From this study, the value of using regression coefficients to portray effect size information is apparent. A reader may be uncertain how to interpret the fact that gender accounted for 4% of the variance in compensation. It is easy to understand a US$ 6575 difference in compensation. The use of unstandardized regression coefficients is not without its limitations. For example, Hunter and Schmidt (1990) noted that coefficients are difficult to cumulate for meta-analysis purposes, and results can be difficult to interpret for variables that lack the intrinsic interpretability of a criterion such as dollars. Judd, McClelland and Culhane (1995) cited the difficulty of using a regression coefficient as an effect size measure when analyzing data from an experiment in which the independent variable was qualitative in nature (this issue will be addressed in more detail in a later section which introduces Abelson’s causal efficacy ratio). Given a correlation is frequently relied upon as a measure of effect size, it is important to be aware that the maximum point-biserial correlation it is possible to compute is not ±1.00. Instead, the maximum possible correlation is a function of the distribution of the dichotomous variable (Nunnally, 1978). When an equal number of persons are in each group, the maximum possible correlation is .80. If the sample split is 90–10%, the maximum correlation possible is .59. For extreme splits, the maximum correlation is quite restricted. For example, for a 95–5% split, the maximum correlation is .47; for a 99–1% split, it is .27. Given the maximum possible point-biserial correlation that can be derived is often much less than ±1.00, it can be informative to interpret the actual correlation in relation to that which is the maximum possible. For example, earlier it was noted that Judge et al. (1995) found the correlation between gender and compensation was .20. A reader might compare

86

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

this value against a correlation of 1.00 and conclude that these variables were not highly related. However, given women comprised only 7% of their sample, the maximum possible correlation that Judge et al. could have found was .52. Thus, instead of the correlation between gender and compensation reflecting 20% of that which was possible, it actually reflects 38% of the maximum possible correlation. In summary, in reporting an effect size for the case of a dichotomous independent variable and a continuous dependent variable, a researcher has a number of possible choices. Depending upon circumstances (e.g., space constraints, the audience), a researcher may be wise to present more than one effect size measure. In those cases in which the criterion variable reflects meaningful units (e.g., dollars, day absent), reporting regression results can be particularly useful in portraying the strength of association between variables. The Case of a Continuous Independent Variable and a Dichotomous Dependent Variable In analyzing the relationship between a continuous independent variable and a dichotomous dependent variable, researchers sometimes have used linear regression analysis. However, as noted by Long (1997), the use of regression with a dichotomous criterion violates several of the assumptions (e.g., that errors are normally distributed, homoscedasticity) underlying the use of this analytic technique and can result in a number of undesirable outcomes (e.g., nonsensical predicted scores). A particular problem with using linear regression with a binary outcome variable is that the result of a significance test, which is based on ordinary least squares estimation, will not be accurate. Instead of using a linear regression model, a logistic regression model, which uses maximum likelihood estimation, is recommended when the dependent variable is dichotomous. If the results of the logistic regression analysis demonstrate a statistically significant relationship, there are several approaches one can take to estimate the effect size of the relationship. For example, Pampel (2000) discussed the use of odds, logged odds, and probabilities (readers interested in a detailed discussion of logistic regression, including the difference between ordinary least squares estimation and maximum likelihood estimation, are referred to the writings of Long (1997), Pampel (2000), and Pedhazur (1997)). Although a detailed treatment of logistic regression is beyond the scope of this paper, the discussion of a simple example may be instructive. Table 3 presents data from a hypothetical study that investigated the relationship between mentoring and being promoted. Mentoring was coded as a continuous variable (e.g., employees reported having had 0, 1, or 2 mentors during their careers). The criterion variable reflects whether a person was promoted in the year following the gathering of the mentoring data. As is apparent from Table 3, although the overall promotion rate was 67%, promotion rates differed by mentoring condition. Although there is no universally accepted significance test for logistic regression (Long, 1997, provided an excellent discussion of commonly reported significance tests such as the likelihood ratio chi-square test and the Wald test), the data reported in Table 3 exceed a p < .01 threshold for all of the commonly used significance tests (e.g., likelihood ratio chi-square = 78.41, p < .01). With regard to effect size, an odds ratio can be a useful approach for making sense out of the relationship that exists between mentoring and being promoted. For the data in Table 3, the relative odds for the three mentoring conditions were: no mentor = .25 (i.e., 20% chance of being promoted/80% chance of not being

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

87

Table 3 Logistic regression example: relationship of mentoring and being promoted Promoted (nc1 = 100, pc1 = .67)

Not promoted (nc2 = 50, pc2 = .33)

No mentor (nr1 = 50, pr1 = .33)

n11 = 10 p11 = .07

n12 = 40 p12 = .27

One mentor (nr2 = 50, pr2 = .33)

n21 = 42 p21 = .28

n22 = 8 p22 = .05

Two mentors (nr3 = 50, pr3 = .33)

n31 = 48 p31 = .32

n32 = 2 p32 = .01

Note 1: nc1 stands for sample size of column 1; nr1 stands for sample size of row 1; pc1 stands for the probability of being in column 1; pr1 stands for the probability of being in row 1. Note 2: Likelihood ratio χ 2 = 78.41, p < .01. Note 3: The odds ratio for having had one mentor vs. none equals 21.0. The odds ratio for two mentors vs. none is 96.0. The odds ratio for having two mentors vs. one is 4.57.

promoted), one mentor = 5.25, and two mentors = 24.00. In terms of using an odds ratio to assess effect size, one compares the effect of mentoring by computing an odds ratio for two conditions at a time. Thus, the odds ratio for having had one mentor vs. none is 21.00 = (5.25/.25). The odds ratio for two mentors vs. none is 96.00, and the odds ratio for having two mentors vs. one is 4.57. An examination of the odds, the odds ratios, and the row probabilities for the data reported in Table 3 makes apparent that an individual is much less likely to be promoted if the person has not had at least one mentor. With logistic regression, the regression coefficient also can be interpreted as a measure of effect size. The regression coefficient reflects the expected change in the log of the odds associated with a one unit change in the independent variable (Pedhazur, 1997). In examining the relationship between a continuous independent variable and a dichotomous criterion, researchers frequently have computed a point-biserial correlation. Although the effect size between two variables reflected by a point-biserial correlation is accurate, the significance test for this correlation is imprecise, given that this correlational approach is based upon the same assumptions as the simple linear regression model (Pampel, 2000). One approach to addressing this issue is to use a more conservative p-value for the significance test. An alternative approach is to use a significance test (e.g., the Wald test) which is based upon maximum likelihood (ML) estimation rather than ordinary least squares estimation. However, the use of ML estimation is not always as straightforward as some researchers have suggested. For example, as noted by Long, “it is risky to use ML with samples smaller than 100, while samples of 500 seem adequate” (1997: 54) and an even larger sample is needed when “there is little variation in the dependent variable (e.g., nearly all of the outcomes are 1)”. In summary, with less than a fairly large sample, the p-value derived from a significance test based upon ordinary least squares estimation or maximum likelihood estimation should be consider an approximation. With regard to estimating an effect size (e.g., comparing the derived correlation to the maximum possible correlation), the same comments made for analyzing data for a dichotomous independent variable and a continuous dependent variable also hold for a continuous independent variable and a dichotomous dependent variable.

88

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

The Case of a Continuous Independent Variable and a Continuous Dependent Variable Given that computing Cohen’s d would require dichotomizing a continuous variable, it should not be used for estimating the effect size between two continuous variables (dichotomizing continuous variables can result in a researcher underestimating the strength of the relationship between variables). Many of the other effect size indices discussed are appropriate for use with continuous variables. If a researcher decides to report r or r2 as an effect size, he or she should be aware that, unless the two continuous variables have equivalent distributions, the maximum possible correlation between them cannot equal ±1.00. Carroll (1961) has provided a formula for computing the maximum correlation possible given the distribution of the data. Fortunately, unless the two distributions differ greatly, the maximum correlation will not be greatly restricted. In estimating the relationship between two continuous variables, researchers frequently have not examined if the relationship is linear or close to linear. If the relationship is not linear, using r, r2 , or linear regression can result in a misleading effect size estimate. Although one can check for non-linearity by adding predictors to a linear regression equation (e.g., testing whether adding a polynomial term to an otherwise linear equation significantly increases the R-squared value), a simpler approach is to compute an eta-coefficient (η). Eta-squared reflects the variance explained in the criterion variable by the predictor variable divided by the total criterion variance (Nunnally, 1978). Given η reflects the square root of this value, it can take on any value from .00 to 1.00. When the relationship between two variables is linear, the value derived for η is identical to that for r. Although η has the desirable property of accurately reflecting the strength of the relationship between two variables when they are not linearly related, it also has limitations. For example, η does not indicate what the form of the relationship is. Alternatively, when one questions the linearity of a relationship, logistic regression can be used. Conversely, as noted by a reviewer of this paper, there may be cases where ignoring non-linearity in one’s data may be appropriate (e.g., when theory suggests a linear relationship). Other Combinations of Independent and Dependent Variables Although we discussed four common research designs (e.g., two dichotomous variables), there are several other designs that space constraints do not allow us to address. We encourage readers to refer to the many sources cited (e.g., Rosenthal et al., 2000) to further their knowledge concerning research designs we did not cover. However, we will address two situations (i.e., inappropriate reliance on an overall effect size measure, a design involving two or more independent variables) in which effect size information is commonly misused. Consider an experiment in which the independent variable was goal setting (assigned goal, self-set goal, no goal) and the dependent variable was task performance. In analyzing such data, a researcher generally will conduct a significance test. If the null hypothesis is rejected, an overall measure of effect size is commonly computed. An eta-squared coefficient and other similar measures (e.g., omega-squared) reflect the overall effect of goal setting (i.e., variance due to treatment divided by total variance). Eta-squared does not reflect how specific conditions differed (e.g., self-set vs. assigned goals). To make a comparison between

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

89

two conditions, a researcher needs to derive an effect size for the two conditions of interest. Unfortunately, researchers sometimes have not done this. An experimental design highlights another complicating factor in deriving an effect size. Consider a study in which a researcher examined the effects of task variety (high vs. low) and performance feedback (specific vs. general) on job satisfaction. It is common for a researcher to report an effect size for each of the independent variables. In the study described, reporting an effect size such as d or η2 could be misleading. This is due to the fact that manipulating each of the independent variables contributes to the total variance for job satisfaction (i.e., if only one of the independent variables had been included in the study, the total variance would likely be smaller). Generally, in a study involving more than one independent variable, a researcher is interested in estimating the effect size for that variable with the effects of the other independent variables removed (Rosenthal et al., 2000). For example, if in estimating the effects of task variety on job satisfaction a researcher did not control for the effects of performance feedback, the researcher would overestimate the pooled variance and thus underestimate the effect size for task variety (for a more detailed discussion of this issue, see Cortina & Nouri, 2000, pp. 15–17). Thus, in situations such as that described, a researcher should not report an effect size without first controlling for the effects of other manipulated variables. Although rarely reported, effect size measures that remove variance due to the effects of other manipulated variables (e.g., partial η2 , a d-value derived with other variables controlled) do exist. Controlling for the effects of other independent variables in estimating an effect size for a given independent variable is a fairly straightforward matter in an experiment. However, this is not the case for a study in which the independent variables were not manipulated. This is due to the fact that in a non-experimental study the independent variables are generally correlated. The topic of partitioning variance in the dependent variable due to the effects of two or more correlated variables is complex, and a discussion of this issue is beyond the scope of this paper (Pedhazur, 1997, provides an excellent discussion of this topic). For our purposes, it suffices to say that, although it would be desirable if a technique such as regression analysis enabled one to partition unique variance in the criterion variable to the various independent variables, most experts agree that there is no way to accurately do this (“It would be better to concede that the notion of ‘independent contribution to variance’ has no meaning when predictor variables are intercorrelated”, Darlington, 1968: 169).

Alternative Approaches for Estimating Effect Size Although there a number of established effect size measures (e.g., r, d, η, odds ratio), for a variety of reasons (e.g., to address interpretation problems which have arisen with commonly used measures), a few researchers have advocated the reporting of some less commonly utilized effect size indices. We will discuss three of these measures. Abelson’s (1995) Causal Efficacy Ratio as an Effect Size Index As noted by Prentice and Miller (1992), in an experiment, an effect size is partially a function of the strength of the manipulation of the independent variable. Given this fact,

90

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

Abelson found it “extraordinary that the causal variable has been largely ignored in the treatment of effect size” (1995: 50). In order to take into consideration the strength of an experimental manipulation, Abelson advocated calculating a “causal efficacy ratio” as a measure of effect size. This ratio reflects the “raw effect size” (i.e., the difference between the experimental group and the control group on the dependent variable) divided by the “cause size” (i.e., the difference between the experimental group and the control group on the independent variable). When the independent variable is objectively quantifiable (e.g., a different number of hours of training), the causal efficacy ratio is equivalent to the regression slope. However, when the independent variable is qualitative in nature, which is frequently the case, the causal efficacy ratio provides different information than that gleaned from the use of regression with dummy or contrast coding (i.e., such coding does not consider how far apart the experimental and the control groups are on the construct underlying the manipulation). In the case of a qualitative independent variable, Abelson suggested using the difference between the manipulation check scores of the experimental and control groups as the measure of cause size. Thus, if two groups differed by four scale points on a manipulation check (e.g., amount of job autonomy) and the groups differed by eight points on a dependent variable (e.g., number of widgets produced), the causal efficacy ratio would be 2.0 (i.e., every scale point on the independent variable resulted in a two-point difference on the dependent variable). If two groups differed by two scale points on the manipulation check and they differed by four points on the dependent variable, the causal efficacy ratio also would be 2.0. The value of Abelson’s causal efficacy ratio is that it forces a consideration of the relative strength of the independent variable. For example, for the job autonomy data presented, frequently, a researcher will focus upon the size of the group difference on the dependent variable. Such a focus might result in a researcher concluding that autonomy had twice the effect in the first study (i.e., an eight widget difference vs. a four widget difference). This difference in magnitude could lead to future research which looked for moderator variables. However, if a researcher had computed causal efficacy ratios, he or she should conclude that it was the difference in the strength of the manipulation that resulted in the outcome difference. In order to make a comparison such as that described, it is important that the same scale be used as a manipulation check across studies. Although to date studies have rarely reported Abelson’s causal efficacy ratio, as researchers such as Tryon (2001) publicize its potential, it is likely to become more commonly reported. Common Language Effect Size Statistics McGraw and Wong (1992) introduced a common language (CL) effect size statistic which converts an effect size into a probability. According to these authors, the “primary value of the proposed statistic is that it is better than available alternatives for communicating effect size to audiences untutored in statistics” (p. 361). For comparing two groups, McGraw and Wong’s CL index represents the probability that a score randomly sampled from one distribution will be larger than a score randomly sampled from a second distribution. To demonstrate the value of their CL index, McGraw and Wong used data on the heights of men and women (they also provided examples based on other variables). In their sample, males averaged 69.7 in. with a standard deviation of 2.8 in.; women averaged 64.3 in. with a

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

91

standard deviation of 2.6 in. The CL index for these data was .92 (the probability of a male being taller than a female is 92%). McGraw and Wong also reported more commonly used effect size indices (i.e., d = 2.00 and r = .71). However, they argued that the probability value of 92% was likely to be more easily understood by most individuals than were the values for d and r. Given that many readers may be unfamiliar with the CL index, quoting a terse description of how to compute it may be worthwhile (interested readers are referred to McGraw and Wong’s article for a detailed discussion and other examples). “In the height example, CL is equal to the probability of obtaining a male-minus-female height score greater than zero from a normal distribution with a mean of 5.4 in. (the difference between the male sample mean of 69.7 in. and the female mean of 64.3 in.) and a standard deviation of 3.82 in. (the root sum of 7.84 and 6.76, the male and female variances). This probability corresponds to the probability of a standardized difference score greater than −1.41, which is the standardized score corresponding to a difference score of 0 in the distribution of male–female height differences, z = (0–5.4)/3.82. The upper tail probability associated with this value, p = .92, corresponds to CL and can be calculated using the normal curve” (McGraw & Wong, p. 361). Although in introducing their CL index McGraw and Wong focused on an example which involved two groups (i.e., males vs. females) and a continuous dependent variable, they also provided variations of their effect size measure which are applicable to other research designs (e.g., when there are more than two groups, when the dependent variable is discrete). In addition, they provided evidence that their CL index is robust to violations of normality. Given their belief that McGraw and Wong had provided “an appealing index of effect size that requires no prior knowledge of statistics to understand” (p. 509), Dunlap (1994) offered an extension of the CL index which applies to bivariate normal correlations (McGraw and Wong’s CL index only applied to discrete groups). Dunlap’s CLR index reflects the probability that, if an individual is above the mean on one variable, he or she will be above the mean on the second variable. For example, according to data presented by Dunlap, if the correlation between the heights of fathers and sons was .40, then there is a 63% probability that a father who was above average in height would have a son of above average height. A Utility Analysis Approach to Demonstrating an Effect Size In 1985, McEvoy and Cascio (1985) published a widely cited paper which used metaanalysis to examine the effects on employee turnover of providing a realistic job preview (RJP) to job applicants. They summarized their findings as follows: “Given the low correlation (phi = .09) found in this meta-analysis between RJPs and reduction in turnover, managers might do well to look elsewhere when seeking turnover reduction strategies” (p. 351). Based upon the results of their own meta-analysis, Premack and Wanous (1985) also estimated that an RJP accounted for less that 1% of the variance in employee turnover. However, they presented a different picture of RJP effectiveness. Premack and Wanous used a utility analysis approach to estimate the cost savings (e.g., lower recruitment, selection, and training costs) that would result from the use of an RJP. For a situation involving a department with 100 employees and an annual turnover rate of 50%, Premack and Wanous

92

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

estimated the use of an RJP would result in an annual employee replacement cost reduction of US$ 58,800. In summary, using a smaller RJP effect size estimate than McEvoy and Cascio (a phi of .06 vs. .09), based upon the results of a utility analysis, Premack and Wanous came to a very different conclusion concerning the likely value of using an RJP. Phillips (1998) recently reported the results of a meta-analysis that replicated and extended Premack and Wanous’ research. She estimated the mean correlation between voluntary turnover and RJP use was .09. Assuming a turnover rate of 50%, a department of 100 employees, and an average cost per hire of US$ 4183 (a figure based upon a published cost estimate), Phillips estimated that a firm using an RJP would have to hire 17 fewer workers per year at a savings of US$ 71,111. This cost savings estimate seems inconsistent with Phillips’ perception that “RJPs have often been discounted because of their relatively small effect sizes” (p. 687). In summary, a case can be made that the comments of McEvoy and Cascio (“managers might do well to look elsewhere”) and other researchers who have focused on “small” RJP effects may have resulted in RJPs being underutilized. Clearly, one would like to account for more than 1% of the variance in turnover. However, given that employee turnover is caused by several factors, researchers should not expect an RJP to have a great effect. Furthermore, even accounting for 1% of the variance can result in considerable cost savings. In this section, we introduced three infrequently used approaches for communicating an effect size. In our opinion, each of these offers real advantages. Abelson’s causal efficacy ratio focuses a researcher’s attention on the strength of the experimental manipulation. The common language effect size approach provides a researcher with an effect size index that should be easily grasped by readers lacking statistical sophistication. Finally, a utility analysis approach helps a researcher translate an effect size into an estimate of actual cost savings.

Interpreting/Misinterpreting Effect Sizes in Other Research Domains If research on realistic job previews were the only content area in which undervaluing a variable’s effect size may have influenced the general view of the topic, such misinterpretation would not be a major concern. However, it does not appear to be an isolated example. Consider research on personality traits. During the 1960s, a few key articles persuaded many individuals that personality measures were “poor predictors” (George, 1992: 199) given they only accounted for a small amount of variance in employee behavior (many correlations reported were in the .20 range). As a result of such articles, during the 1970s and 1980s, personality traits substantially declined in importance as an individual difference variable. Beginning around 1990, interest in personality variables went through a “rebirth” according to Hough and Schneider (1996, p. 33). Much of the credit for the renewed importance being attached to personality variables was given to a meta-analysis published by Barrick and Mount (1991). What is interesting about this rebirth of interest in personality is that the magnitude of the correlations which led to the “demise of personality variables” (Hough & Schneider, 1996: 32) is similar to that reported by Barrick and Mount. What appears to have happened is that in the 1990s researchers recognized that accounting for modest levels of variance can have important consequences (as noted by a reviewer of this article, part of

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

93

this “rebirth” may have been due to a focus on corrected correlations which are larger than correlations which have not been corrected for artifacts such as unreliability). Leadership is another content area in which interpretations of effect size appear to have had an effect (“The rise and fall of interest in and concern about leadership in organizational studies can be roughly indexed by the amount of variance explained by studies in the area”, Fichman, 1999: 296). Fichman (1999) provided an excellent treatment of how a “focus on explaining variance can have detrimental consequences for theory development” (p. 296). In addition to citing the leadership area, Fichman also provided examples from other areas (e.g., job design) to support his position that accounting for a small amount of variance does not necessarily mean that a theoretical variable is unimportant or that a theoretical model is unsound. Given that some organizational scientists conduct research on topics (e.g., stereotyping, employee selection) that have direct relevance to legal disputes, it is worth noting that courts have used effect size information to evaluate research results introduced in lawsuits. Unfortunately, it appears that in a number of cases judges and expert witnesses have relied on measures of variance accounted for and have viewed a small r2 as reflecting an unimportant relationship between variables (a recent issue of Psychology, Public Policy and Law, 1999, which focused upon sexual harassment, provides numerous examples of effect size information drawn from academic studies being discussed in a legal context). In contrast to organizational researchers, medical and public health researchers have used a greater variety of effect size indices and have not been so harsh in judging the strength of relationships. Consider two examples. Rosenthal (1990) reported that the “well-established” relationship between taking aspirin and reducing heart attacks is based upon a study in which the correlation between these two variables was .03 (the risk ratio was a more impressive figure). Lubinski and Humphreys (1997) reported that the recent health campaign aimed at persuading pregnant women not to smoke is based upon a phi coefficient of .10 between a mother’s behavior (smoker vs. non smoker) and an infant’s birth weight (normal vs. below normal). It is likely that some organizational researchers would dismiss correlations of this magnitude as being trivial.

Summary The intent of this paper was not to review all of the possible effect size measures. Nor was the intent of this paper to advocate the use of any one effect size measure. Rather our goal was to convey a sense of the multitude of factors (e.g., the intended audience, the potential costs savings of even a small effect) that should be considered in selecting effect size measures and in interpreting the magnitude of effect size estimates. To accomplish this goal, (a) we reviewed a number of frequently used effect size indices (e.g., d, r), (b) we discussed mistakes made by researchers in estimating an effect size (e.g., not controlling for the effects of other independent variables in estimating an effect size in an experiment), (c) we noted mistakes made in interpreting the size of an effect (e.g., assuming the maximum possible correlation was ±1.00), and (d) we described a number of less commonly used effect size measures (e.g., common language effect size index) which may help a researcher convey the importance of a statistical relationship to his or her audience. Given the pedagogical

94

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

orientation of this paper, we did not see a need to cite specific articles in which effect sizes were inappropriately used or interpreted. In concluding this paper, there are a two remaining issues (i.e., rigidly applying guidelines for evaluating the magnitude of an effect size, focusing upon a population effect size vs. a sample-based effect size estimate) that should be briefly addressed. With regard to the first issue, in interpreting the magnitude of an effect size, researchers frequently have failed to consider the multitude of factors (e.g., a weak experimental manipulation, a skewed criterion variable) that can influence the magnitude of an effect size estimate. Instead, researchers often have almost automatically applied guidelines that have been provided for judging the magnitude of an effect size. For example, in the context of establishing guidelines for estimating statistical power, Cohen (1988) offered the following operational definitions of small, medium, and large effect sizes for r (.10, .30, .50). A careful reading of Cohen makes clear that he never intended for the rigid application of his guidelines for judging the size of a relationship (“The meaning of any given ES is, in the final analysis, a function of the context in which it is embedded”, Cohen, 1988: 535). Yet, it is quite common to see authors make statements such as “the correlation of .12 represents a small effect according to Cohen’s standards”. In terms of interpreting the magnitude of an effect size, a reviewer of this paper correctly noted that many of the examples we provided suggest that even a small effect size is likely to be important. To clear up any confusion, we emphasize that we did not mean to suggest this. For example, accounting for 1% of the variance in punctuality by means of an expensive selection test may not be a good investment (a utility analysis might demonstrate this). Alternatively, in other cases (Prentice & Miller, 1992, provided a good discussion of this issue), a small effect size can suggest an important relationship. As an example of this, Martell, Lane and Emrich (1996) provided an example of how even a small sex bias effect in performance ratings can compound over several years to have a major effect on the number of women who get promoted. With regard to the issue of focusing upon a population effect size vs. a sample-based effect size estimate, we want to make a few simple points. Given that most scientific theories concern relationships among constructs, in testing a theory, a problem a researcher faces is that “experiments involve imperfect embodiments of independent-variable and dependent-variable constructs” (Maxwell & Delaney, 1990: 96). Correlational studies also suffer from imperfect measurement of predictor and criterion constructs. Given that the measures used in a study do not perfectly represent the constructs they are designed to reflect, it is likely that an effect size will be an imprecise estimate of the magnitude of the relationship between two constructs. Given this likelihood, it has been suggested (e.g., Hunter & Schmidt, 1990) that, when a researcher is interested in a relationship between constructs, he or she should take steps prior to conducting a study to lessen the biasing effects of statistical artifacts such as sampling error, measurement error and range restriction or enhancement. Among these actions are (a) having a large sample, (b) using measures which have been shown to provide scores which are reliable, and (c) choosing a sample in which range restriction or range enhancement is unlikely to be a problem. Alternatively, a researcher could attempt to correct for various artifacts once a study has been completed. For example, variables measured with less than perfect reliability may be corrected for attenuation if reliability information is available. In terms of correcting for range

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

95

restriction, this can be done if information on the unrestricted variance of a variable is available (as noted by a reviewer of this paper, an unstandardized regression coefficient is not affected by range variability). An alternative means for providing a reader with a sense of the accuracy of a sample-based effect size estimate is to provide confidence interval information. Although correcting for artifacts (readers interested in this topic are referred to Hunter & Schmidt, 1990, for details) and providing confidence intervals (Cortina & Nouri, 2000, addressed how to compute confidence intervals for effect size measures such as d and r) are sound strategies, they do not address a critical issue—Is one’s sample representative of the population of interest? This issue has received insufficient attention in many studies. As is made clear in any basic book on sampling, ideally a researcher should begin the research process with a careful consideration of the population of interest (e.g., working spouses with dependents). Having determined the relevant population, a researcher would decide upon an appropriate sampling strategy (e.g., random sampling, stratified sampling). Such a sampling strategy enhances the likelihood that one’s results will generalize to the population of interest. In contrast to the described procedure, researchers frequently have used convenience samples for testing hypotheses. Although the use of a convenience sample may be a researcher’s only option, its use makes it difficult to assess the extent to which a sample-derived effect size reflects the population effect size. In summary, reaching an informed judgment about an effect size may require considerable effort on the part of a researcher. In many cases, in order to really understand one’s results, it may be necessary for a researcher to compute several different effect size measures even if space constraints limit the amount of effect size information that can be presented in a paper. No matter which effect size measure(s) a researcher decides to use, he or she should carefully consider the evaluative standard that he or she applies. As should be apparent from some of the examples given in this paper, in comparison to researchers in other disciplines, organizational researchers frequently have been quite harsh in judging the magnitude of the relationships they have documented. Such harshness not only may impact on scientific progress but may also lessen the value attached by the public to our research endeavors. In closing, we note that at numerous places in this paper, we referred to certain effect size measures (e.g., a log odds ratio) as being more difficult to interpret than other effect size measures (e.g., risk ratio). Our hope is that with the increasing reporting of effect sizes all of these measures will become increasingly familiar and thus more easily interpretable. Acknowledgments The author wishes to thank Gary Burger and Michael Harris for providing insightful comments on this paper. References Abelson, R.P. 1995. Statistics as principled argument. Hillsdale, NJ: Erlbaum. Anders, G. October 15, 1996. Who pays cost of cut-rate heart care? Wall Street Journal, B1, B9. Barrick, M. R., & Mount, M. K. 1991. The big five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44: 1–26.

96

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

Bobko, P. 1995. Correlation and regression: Principles and applications for industrial/organizational psychology and management. New York: McGraw-Hill. Carroll, J. B. 1961. The nature of the data, or how to choose a correlation coefficient. Psychometrika, 26: 347–372. Cohen, J. 1988. Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum. Cortina, J. M. & Nouri, H. 2000. Effect sizes for ANOVA designs. Thousand Oaks, CA: Sage. Darlington, R. B. 1968. Multiple regression in psychological research and practice. Psychological Bulletin, 69: 161–182. Dunlap, W. P. 1994. Generalizing the common language effect size indicator to bivariate normal correlations. Psychological Bulletin, 116: 509–511. Fichman, M. 1999. Variance explained: Why size does not (always) matter. In B. M. Staw (Ed.). Research in Organizational Behavior, 21, 295–331. Fleiss, J. L. 1994. Measures of effect size for categorical data. In H. Cooper & L. V. Hedges (Eds.). The handbook of research synthesis: 246–259. New York: Russell Sage Foundation. George, J. M. 1992. The role of personality in organizational life: Issues and evidence. Journal of Management, 18: 185–213. Haddock, C. K., Rindskopf, D., & Shadish, W. R. 1998. Using odds ratios as effect sizes for meta-analysis of dichotomous data: A primer on methods and issues. Psychological Methods, 3: 339–353. Hough, L. M. & Schneider, R. J. 1996. Personality traits, taxonomies, and applications in organizations. In K. R. Murphy (Ed.), Individual differences and behavior in organizations: 31–88. San Francisco: Jossey-Bass. Hunter, J. E. & Schmidt, F. L. 1990. Methods of meta-analysis. Newbury Park, CA: Sage. Judd, C. M., McClelland, G. H., & Culhane, S. E. 1995. Data analysis: Continuing issues in the everyday analysis of psychological data. Annual Review of Psychology, 46: 433–465. Judge, T. A., Cable, D. M., Boudreau, J. W., & Bretz, R. D. 1995. An empirical investigation of the predictors of executive career success. Personnel Psychology, 48: 485–519. Kirk, R. 1996. Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56: 746–759. Long, J. S. 1997. Regression models for categorical and limited dependent variables. Thousand Oaks, CA: Sage. Lubinski, D., & Humphreys, L. G. 1997. Incorporating general intelligence into epidemiology and social sciences. Intelligence, 24: 159–201. Martell, R. F., Lane, D. M., & Emrich, C. 1996. Male–female differences: A computer simulation. American Psychologist, 51: 157–158. McDonald, R. P. 1999. Test theory. Mahwah, NJ: Lawrence Erlbaum Associates. Maxwell, S. E., & Delaney, H. D. 1990. Designing experiments and analyzing data. Belmont, CA: Wadsworth. McEvoy, G. M., & Cascio, W. F. 1985. Strategies for reducing employee turnover: A meta-analysis. Journal of Applied Psychology, 70: 343–353. McGraw, K. O., & Wong, S. P. 1992. A common language effect size statistic. Psychological Bulletin, 111: 361–365. Nunnally, J. C. 1978. Psychometric theory. New York: McGraw-Hill. Oakes, M. 1986. Statistical inference: A commentary for the social and behavioural sciences. Chichester, England: John Wiley and Sons. Pampel, F. C. 2000. Logistic regression: A primer. Thousand Oaks, CA: Sage. Pedhazur, E. J. 1997. Multiple regression in behavioral research. Fort Worth, TX: Harcourt Brace. Phillips, J. M. 1998. Effects of realistic job previews on multiple organizational outcomes: A meta-analysis. Academy of Management Journal, 41: 673–690. Premack, S. L., & Wanous, J. P. 1985. A meta-analysis of realistic job preview experiments. Journal of Applied Psychology, 70: 706–719. Prentice, D. A., & Miller, D. T. 1992. When small effects are impressive. Psychological Bulletin, 112: 160–164. Psychology, Public Policy and Law. 1999, 5. Richardson, J. T. 1996. Measures of effect size. Behavioral Research Methods, Instruments, & Computers, 28: 12–22. Rosenthal, R. 1990. How are we doing in soft psychology? American Psychologist, 45: 775–777. Rosenthal, R. 1994. Parametric measures of effect size. In H. Cooper & L. V. Hedges (Eds.). The handbook of research synthesis: 231–244. New York: Russell Sage.

J.A. Breaugh / Journal of Management 2003 29(1) 79–97

97

Rosenthal, R., Rosnow, R. L., & Rubin, D. B. 2000. Contrasts and effect sizes in behavioral research. New York: Cambridge University Press. Rudas, T. 1998. Odds ratios in the analysis of contingency tables. Thousand Oaks, CA: Sage. Tryon, W. W. 2001. Evaluating statistical difference, equivalence, and indeterminancy using inferential confidence intervals: An integrated alternative method of conducting null hypothesis significance tests. Psychological Methods, 6: 371–386.

James A. Breaugh is a Professor of Management in the College of Business Administration at the University of Missouri, St. Louis. He received his Ph.D. from Ohio State University. His research interests include employee recruitment, worker autonomy, and applied measurement issues.

Suggest Documents