Document not found! Please try again

The Power of Statistical Tests for Moderators in ... - Semantic Scholar

8 downloads 0 Views 175KB Size Report
compute statistical power of both fixed- and mixed-effects moderator tests in meta-analysis ... cussed the rationale for computing statistical power in meta-.
Psychological Methods 2004, Vol. 9, No. 4, 426 – 445

Copyright 2004 by the American Psychological Association 1082-989X/04/$12.00 DOI: 10.1037/1082-989X.9.4.426

The Power of Statistical Tests for Moderators in Meta-Analysis Larry V. Hedges

Therese D. Pigott

University of Chicago

Loyola University of Chicago

Calculation of the statistical power of statistical tests is important in planning and interpreting the results of research studies, including meta-analyses. It is particularly important in moderator analyses in meta-analysis, which are often used as sensitivity analyses to rule out moderator effects but also may have low statistical power. This article describes how to compute statistical power of both fixed- and mixed-effects moderator tests in meta-analysis that are analogous to the analysis of variance and multiple regression analysis for effect sizes. It also shows how to compute power of tests for goodness of fit associated with these models. Examples from a published meta-analysis demonstrate that power of moderator tests and goodness-of-fit tests is not always high.

strength of the relation between an independent or predictor variable and a dependent or criterion variable” (Baron & Kenny, 1986, p. 1174).1 In primary analysis involving discrete independent variables (e.g., a treatment), the effects of discrete moderators correspond to interactions between the moderator variable and the independent variable (this is Case I in the framework of Baron & Kenny, 1986), and tests can be formulated using the analysis of variance (ANOVA). In primary analysis involving discrete independent variables, the effects of continuous moderators also correspond to interactions between the moderator variable and the independent variable (this is Case III in the framework of Baron & Kenny, 1986), and tests can be formulated using multiple regression analysis (see West, Aiken, & Krull, 1996). Therefore, moderator effects are interactions, and this interaction is, by definition, the degree to which the relation between the independent variable and the dependent variable depends on the value of the moderator variable. In meta-analysis, the effect size represents the relation between a discrete independent variable (e.g., a treatment) and the dependent variable in a study. Therefore, in metaanalysis, a relation between effect size and a moderator variable corresponds to an interaction between the independent variable (e.g., a treatment) incorporated in the effect size and that moderator variable. Thus, in meta-analysis, moderator effects correspond to relations between the moderator variable and effect size. In meta-analysis, tests for the

Quantitative summaries of the results of empirical research studies, or meta-analyses, are widely used in psychology, medicine, education, and the social sciences. Because important scientific and policy decisions are increasingly being informed by meta-analyses, it is important to be able to distinguish situations in which metaanalysis provides statistical tests that are likely to detect the effects of the size that are anticipated (situations in which there is high statistical power) from those in which the tests are likely to be less sensitive. Computation of the power of statistical tests about average effects and heterogeneity of effects in meta-analysis was considered by Hedges and Pigott (2001). They discussed the rationale for computing statistical power in metaanalysis and demonstrated the correspondence between procedures for power analysis in primary research and in metaanalysis. They also showed that although the power of statistical tests for average (main) effects and homogeneity tests in meta-analysis is often high, it is not always high. Thus, analyses of statistical power have an important role to play in the planning and interpretation of meta-analyses. Although computation of average effects and tests for heterogeneity are important in meta-analysis, many questions that arise in meta-analysis are of a somewhat different form, involving comparisons of the mean effects of groups of studies that have different characteristics, often called moderator analyses. A moderator variable is a discrete or continuous variable that “affects the direction and/or

1

Larry V. Hedges, Department of Sociology and Department of Psychology, University of Chicago; Therese D. Pigott, School of Education, Loyola University of Chicago. Correspondence concerning this article should be addressed to Larry V. Hedges, University of Chicago, 1155 East 60th Street, Room 265A, Chicago, IL 60637.

Note that it is important to distinguish moderator effects from mediator effects. A mediator variable “accounts for the relation between a predictor and the criterion” variable (Baron & Kenny, 1986, p. 1176). Analytic strategies for studying mediation in primary analysis have been described in Baron and Kenny (1986) and Kenny, Kashy, and Bolger (1998). 426

POWER OF TESTS FOR MODERATORS IN META-ANALYSIS

427

effects of a discrete moderator variable can be formulated using meta-analytic analogues to ANOVA to test the relation between the moderator variable and effect size (see Hedges, 1982a, 1992; Hedges & Olkin, 1985). In metaanalysis, tests for the effects of a continuous moderator variable can be formulated using meta-analytic analogues to regression analysis to test the relation between the moderator variable and effect size (see Hedges, 1982b, 1992; Hedges & Olkin, 1985). Hedges and Pigott (2001) did not consider computation of the statistical power of tests involving moderator variables. However, there are three reasons that it is particularly important to be able to compute the power of statistical tests involved in moderator analyses. The first reason arises from the nature of moderator analyses as statistical tests. Moderator analyses are conceptually analyses of interactions (the interaction of treatment and a moderator variable), and tests for interactions are less powerful than tests for main effects in the same designs (see, e.g., Cronbach & Snow, 1981). Therefore, tests for moderator effects in meta-analysis are likely to be less powerful than tests for the average effect size. Thus, to assure the sensitivity of statistical tests involved, it is even more important to be able to compute the power of tests for moderator variables in meta-analysis than it is to be able to compute power for tests of main effects. The second reason that power analyses for moderator tests are important concerns the substantive application of moderator tests in meta-analysis. Moderator tests are often used as sensitivity analyses, asking whether effects differ across important subgroups of studies. In such situations, the conclusion of no difference is interpreted as evidence that effects do not differ across subgroups. Such an interpretation is problematic unless it is clear that the power is high to detect meaningful differences in effects across subgroups. (Such comparisons can be problematic for other reasons, but the issue of adequate statistical power is always a potentially complicating factor.) The third reason that power analyses are needed in connection with moderator tests specifically concerns tests of goodness of fit or residual variance components. Such tests are often used as evidence of the adequacy or specification of explanatory models in meta-analysis. Here, as in the case of tests of the effects of moderators, failure to reject the null hypothesis of good fit (or, equivalently, that the residual variance component is zero) is taken as support for the hypothesis of correct (or complete) model specification. Such an interpretation is problematic unless it is clear that the test involved has high statistical power to detect important levels of heterogeneity or misspecification.

Prospectively, they may be used to plan moderator analyses within a larger meta-analytic study. In planning a test for the effect of a moderator, the researcher must prespecify the size of a moderator effect that is deemed to be substantively important. The power calculation determines whether the statistical test proposed to detect the moderator effect has sufficient power. As in primary analyses, a researcher may determine that it is unwise to proceed with a test for a moderator variable that will have low power. There are two reasons for such a strategic decision. First, if the power is low, the test is unlikely to detect the moderator effect (find it statistically significant) even if it is present and large enough to be substantively meaningful. Second, if the test were carried out, a failure to detect the moderator effect would be difficult to interpret if the power was low. If power is low and a researcher decides to proceed with moderator analyses anyway, power calculations can provide a quantification of the (lack of) sensitivity of the tests involved and thus help circumscribe strong interpretations of statistically insignificant results. Retrospectively, power analyses may be used to evaluate moderator analyses that have already been conducted, by providing information about how sensitive the statistical tests were to detect substantively meaningful effects. If tests for moderator effects are found to have low power, statistically nonsignificant effects of moderators do not provide strong evidence for ruling out moderator effects. Alternatively, if a test for a moderator is found to have very high statistical power to detect the smallest substantively meaningful effect, then failure to detect that effect is evidence that moderator effects are not likely to be large enough to be substantively meaningful. In the retrospective application of power analysis, as in the prospective one, the researcher must pre-specify the size of a moderator effect that is deemed to be substantively important. That is, the moderator effect size must be determined a priori. In particular, the observed effect of the moderator variable should never be used to compute statistical power. As critics of retrospective power analysis have noted, “Calculating power using observed sizes is not helpful because such values are very poor estimates of the actual power given the population effect size and do not take into account the biological [substantive] significance of the effect size value used” (Thomas, 1997, p. 278). However even critics of retrospective power analysis argue that “calculating power using pre-specified effect sizes . . . is helpful” (Thomas, 1997, p. 278).

Use of Power Calculations in Meta-Analysis

Meta-analysis was defined as “the statistical analysis of a large collection of analysis results from individual studies for the purposes of integrating findings” (Glass, 1976, p. 3). The first meta-analyses labeled as such represented the

Power calculations in meta-analysis, as in primary analysis, may be used either prospectively or retrospectively.

Statistical Models for Meta-Analysis

HEDGES AND PIGOTT

428

results of statistical analyses via effect sizes that were standardized mean differences or z-transformed correlation coefficients. However several other effect size measures are now also used in meta-analysis (see, e.g., Fleiss, 1994; Lipsey & Wilson, 2001; Rosenthal, 1994). Although early statistical work on meta-analysis often derived statistical procedures for a single effect size measure, statisticians have come to recognize that most of these statistical methods are based on the same large sample model (see, e.g., Cooper & Hedges, 1994, pp. 36 –37; Hedges, 1992). This model implies that the sampling distribution of the effect size estimate T is normally distributed about its corresponding effect size parameter ␪ with a variance that may be treated as known, either because it is computed analytically or because it can be estimated very accurately from available information such as sample size and the effect size estimate. How well this model describes the effect size depends on the particular effect size index and on how well the assumptions of the primary statistical analysis are met. If the assumptions of the primary statistical analysis are met (i.e., if within-study data have the multivariate normal distribution), the z-transformed correlation coefficient is normally distributed about the z-transformed correlation parameter with variance 1/(n ⫺ 3). Thus the model used in this article is quite accurate when the z-transformed correlation is the effect size. If the assumptions of the t test are met within a study, the standardized mean difference in balanced designs is approximately normally distributed about the standardized mean difference parameter with variance





2 ␪2 1⫹ , n 8 where n is the within-treatment group sample size. In this case, the variance depends on the true standardized mean difference ␪ and the estimate T must be substituted for ␪ to obtain an estimate of the variance. However, as ␪ is typically less than 1 and the second term has only one eighth the weight of the first, the substitution of T for ␪ has only modest impact on the variance. In fact, as the second term is typically only a few percent of the first term, omitting the second term of the variance altogether would have only modest impact in many applications. There is considerable evidence that the model of known variance works reasonably well for the standardized mean difference when the within-study (treatment and control group) sample sizes are not too small. Other effect sizes that are less commonly used in psychology include the log odds ratio, the log risk ratio, and the difference between proportions (see Fleiss, 1994). The variance of these effect size measures also depends on nuisance parameters (and on the effect size in a less transparent way). But as long as none of the within-study sample sizes are too

small, the normal approximation to the distribution of these effect sizes also appears to work well. However, less is known about how well these approximations work in situations in which the assumptions of the statistical models used in primary analysis are violated. There is at least some evidence that standard meta-analytic tests, like their analogues in primary analysis, are affected by violations of assumptions (see, e.g., Harwell, 1997). It is possible that power calculations (which involve noncentral distributions of test statistics) may more sensitive to these violations than are significance tests (which involve only central distributions of test statistics). An additional assumption is involved in statistical tests used in random and mixed-effects models for meta-analysis. The assumption is that the study-specific random effects are normally distributed. Because the random effects are not observed directly, this assumption is difficult to check. However, there is at least some evidence that, as long as the variance of the random effects is small compared with the variance of the sampling errors of the individual effect size estimates, even rather profound nonnormality of random effects may have little effect on tests for mean effect sizes in meta-analysis (see Hedges & Vevea, 1996). In this article we provide methods for calculating the statistical power in both fixed-effects and mixed-effects meta-analytic procedures that are analogues to ANOVA. We include procedures to compute power for omnibus tests for differences among two or more group mean effect sizes and a priori (planned) and post hoc comparisons among group mean effect sizes. We also provide methods for calculating the statistical power of both fixed-effects and mixed-effects meta-analytic procedures that are analogues to multiple regression analysis. We consider both simultaneous tests for blocks of regression coefficients and tests for individual regression coefficients. Finally we consider tests of goodness of fit of both fixed-effects and mixed-effects statistical procedures. We illustrate the calculations with examples from a published meta-analysis.

ANOVA Analogues We describe first analogues to fixed-effects ANOVA and then the corresponding procedures that are analogues to mixed-effects ANOVA. In both of these models, the study grouping variable is taken to be a fixed effect. That is, each value of the study grouping variable (e.g., the moderator variable) defines a group of studies. As in ANOVA in primary research, because each group may have a different mean effect size, each group is a separate population of studies. In fixed-effects models, the inferences are to a population of studies with the same effect size parameters as the observed studies. In mixed models, the inferences are to a population of studies from which the observed studies are a sample. In particular, each observed group of studies is

POWER OF TESTS FOR MODERATORS IN META-ANALYSIS

taken to be a sample from a population of studies with the same value of the grouping (moderator) variable. In mixed models, between-studies within-groups variation in effect size parameters is taken to be the result of sampling of studies from the population of studies with the same value of the grouping variable. The principal difference between fixed-effects and mixedeffects statistical procedures is in the calculation of the variance of the group mean effect sizes. In fixed-effects procedures, between-study-within-groups-of-studies variation does not influence the uncertainty of group mean effect sizes. In mixed-model procedures, such variation does contribute to the uncertainty of group mean effect sizes.

429

冘w␪ mi

ij ij

␪៮ i• ⫽

j⫽1

冘w

(2)

mi

ij

j⫽1

and vi• is given by

冘w. mi

v ⫺1 i• ⫽

(3)

ij

j⫽1

The weighted grand mean T៮ •• is given by

冘 w T៮ p

Fixed-Effects Procedures

i• i•

Suppose that there are p groups of studies and there are mi studies in the ith group, and let k ⫽ m1 ⫹ m2 ⫹ . . . ⫹ mp be the total number of studies. Denote the effect size estimate in the jth study in the ith group by Tij and the corresponding effect size parameter by ␪ij. Let the variance (square of the standard error) of Tij be vij, and assume that Tij is normally distributed about ␪ij, that is,

T៮ •• ⫽

i⫽1

冘w

,

m

i•

i⫽1

where wi• ⫽ 1/vi•, and the corresponding population parameter is

冘 w ␪៮ p

T ij ⬃ N共␪ij , vij 兲.

(1)

Alternatively, we might say that Tij ⫽ ␪ij ⫹ ␧ij, where ␧ij ⬅ Tij ⫺ ␪ij and

i• i•

␪៮ •• ⫽

i⫽1

冘w

.

p

(4)

i•

i⫽1

␧ ij ⬃ N共0, vij 兲.

Omnibus Tests of Between-Groups Differences This distributional assumption is the large sample approximation typically used to develop the theory of statistical tests used in meta-analysis. The actual formula for the variance vij depends on the effect size index chosen (see, e.g., Hedges & Olkin, 1985; Lipsey & Wilson, 2001). The weighted mean effect size for the ith group is T៮ i• given by

冘wT 冘w mi

冘 w 共T៮ ⫺ T៮ 兲 , i•

i•

••

2

(6)

i⫽1

,

ij

j⫽1

where wij ⫽ 1/vij, which has a sampling distribution given by T៮ i• ⬃ N共␪៮ i• , vi• 兲, where ␪៮ i• is given by

(5)

(analogous to the F test in ANOVA), uses the test statistic QB ⫽

ij ij

j⫽1

H 0: ␪៮ 1• ⫽ ␪៮ 2• ⫽ · · · ⫽ ␪៮ p•

p

mi

T៮ i• ⫽

The omnibus test of the null hypothesis that the group mean effect sizes are equal,

which has the chi-square distribution with (p ⫺ 1) degrees of freedom when the null hypothesis is true. Therefore, the omnibus test at significance level ␣ rejects the null hypothesis when QB ⬎ c␣, where c␣ is the 100(1 ⫺ ␣) percentile point of the (central) chi-square distribution with (p ⫺ 1) degrees of freedom. When the null hypothesis is false—that is, when some of the group mean effect sizes differ from one another—then QB has a noncentral chi-square distribution with (p ⫺ 1) degrees of freedom and noncentrality parameter ␭B given by

HEDGES AND PIGOTT

430

冘 p

␭B ⫽

wi• 共␪៮ i• ⫺ ␪៮ •• 兲2 .

(7)

i⫽1

The power of the test based on QB at significance level ␣ is therefore 1 ⫺ H共c ␣兩p ⫺ 1; ␭ B兲,

(8)

where H(x兩v; ␭) is the cumulative distribution function of the noncentral chi-square with v degrees of freedom and noncentrality parameter ␭. This distribution is tabulated and is widely available in statistical software such as SAS, SPSS, SPlus, and STATA. Choosing values for the noncentrality parameter. Using Equation 8 to compute the statistical power of the test of between-groups differences in effect size requires a value of ␭B. If the studies have already been collected and data extracted from the studies, then the number of studies in each group and the sampling variances of each group mean effect size (which is derived from the sampling variances of the individual effect sizes) will be known. The only other elements that are required to compute ␭B are the ␪៮ i• values. In some circumstances it may be possible to guess the ␪៮ i• values directly, so that Equation 7 can be used to obtain ␭B. In other cases, particularly in the preliminary stages of a meta-analysis, it may be difficult to guess the ␪៮ i• values of interest (and even the vi• values), so approaches to computing ␭B that do not require guessing all of the ␪៮ i• values may be easier to use. We give two such approaches. If p ⫽ 2 and there are only two groups, then ␭ reduces to

␭B ⫽

共␪៮ 1• ⫺ ␪៮ 2• 兲2 , v1• ⫹ v2•

(9)

so that values of ␭B may be computed from the smallest difference among group mean effect sizes that is of substantive importance. If the variances vij of the individual effect size estimates are identical to v or nearly so, and if all groups have the same number m of effect sizes, then wi• ⫽ m/v. In this case one might treat ␭B as m(p ⫺ 1)/v times the “variance” of the ␪៮ i• values. Thus ␭B is m(p ⫺ 1) times the ratio of the between-groups-of-studies “variance” to the within-study variance. The size of a potential noncentrality parameter might therefore be guessed using a guess of this ratio. Example of the fixed-effects omnibus test. The data used in all examples come from a meta-analysis of the effects of phonics instruction on children’s development of reading skills commissioned by the National Reading Panel (Ehri, Nunes, Stahl, & Willows, 2001). The meta-analysis identified 38 experiments comparing treatment and control conditions, resulting in 66 effect sizes. The article also examined a number of moderator variables using a series of

single variable categorical models. One of these factors was socioeconomic status (SES). The research literature on the effectiveness of phonics instruction has found varying effects for children from low- to middle-income families. Table 1 provides the effect sizes and summary statistics from studies with children from low-income families, studies with children from middle-income families, and studies with children from mixed (both low and middle) SES backgrounds. To illustrate the omnibus test of between-groups differences, we first restrict our attention to studies that include either low-income or middle-income children. We can use the results in Table 1 to compute the power of the ␣ ⫽ .05 level test when we are interested in detecting a difference between the means equal to 0.25, or one fourth of a standard deviation. We can use Equation 9 to compute the noncentrality parameter because in this case p ⫽ 2. Using Table 1, we find that the sum of the variance of the weighted group mean effect sizes is 0.009 ⫹ 0.007 ⫽ 0.016. Thus, the noncentrality parameter is ␭B ⫽ (0.25 ⴱ 0.25)/0.016 ⫽ 3.906. The 95% critical value for the central chi-square distribution with 1 degree of freedom is 3.841. The value of the cumulative distribution of the noncentral chi-square at critical value 3.841, with 1 degree of freedom and a noncentrality parameter of 3.906, is 0.49. The power to detect a 0.25 difference between the low-income and middle-income means is 1 ⫺ 0.49 ⫽ 0.51. The test statistic QB for testing the difference between the effect sizes for studies with low- and middle-income students is QB ⫽ 2.76, which is not significant at the ␣ ⫽ .05 significance level. The relatively low power of the test suggests that this test does not provide persuasive evidence about whether SES moderates effects. If this power calculation had been conducted as part of planning the analyses, the test might not have been carried out because the power was so low. To illustrate computing power with three groups (that is p ⫽ 3), suppose that we believe that the difference between the means of the low-SES and middle-SES students is 0.25 standard deviations (as above), but now suppose that we hypothesize that the mean of the mixed-SES group is halfway in between (i.e., 0.125 standard deviations different than either of the other two). Because ␭B is invariant under translations of the ␪៮ i•, we can simplify calculation of ␭B by setting the mean for the middle-SES group to 0. Using these values (viz., 0.000, 0.125, and 0.250) for the group mean effect sizes and the variances of the group mean effect sizes from Table 1, we find that the grand mean (the weighted average of the ␪៮ i•) is

␪៮ •• ⫽ 共0.250/0.009 ⫹ 0.125/0.002 ⫹ 0.000/0.007兲/共1/0.009 ⫹ 1/0.002 ⫹ 1/0.007兲 ⫽ 0.120

POWER OF TESTS FOR MODERATORS IN META-ANALYSIS

431

Table 1 Systematic Phonics Instruction Effects for Children From Different Income Groups

SES Low income

N

Effect size

Effect size variance (vi)

Group summary statistics

159.00 105.00 112.00 38.00 31.00 29.00

0.72 0.63 0.73 0.07 1.19 0.47

0.03 0.04 0.04 0.11 0.15 0.14

dlow  ⫽ 0.66, vlow  ⫽ 0.009

67.00 85.00 24.00 56.00 24.00 24.00 118.00 36.00 112.00 80.00

0.27 ⫺0.11 0.00 0.53 0.14 ⫺0.07 0.39 0.16 0.53 2.27

0.06 0.05 0.17 0.07 0.17 0.17 0.03 0.11 0.04 0.08

dmid  ⫽ 0.45, vmid  ⫽ 0.007

144.00 276.00 320.00 247.00 68.00 70.00 35.00 57.00 112.00 37.00 32.00 40.00 42.00 49.00

0.51 0.25 0.38 0.60 0.91 0.36 0.12 0.03 0.20 0.60 0.21 0.24 0.50 0.76

0.03 0.01 0.01 0.02 0.06 0.06 0.11 0.07 0.04 0.11 0.13 0.10 0.10 0.09

dmixed  ⫽ 0.39, vmixed  ⫽ 0.002

95%CI: [0.47, 0.85] QWlow ⫽ 5.56, ns d*low  ⫽ 0.64, v*low  ⫽ 0.025

Middle income

Mixed income

95% CI: [0.28, 0.61] QWmid ⫽ 52.70, p ⬍ .05 d*mid  ⫽ 0.45, v*mid  ⫽ 0.016

95% CI: [0.29, 0.48] QWmixed ⫽ 15.11, ns d*mixed  ⫽ 0.41, v*mixed  ⫽ 0.010

Note. CI ⫽ confidence interval. These data are from Ehri, Nunes, Stahl, and Willows (2001). SES ⫽ socioeconomic status; CI ⫽ confidence interval.

and the noncentrality parameter is

␭ B ⫽ 共0.250 ⫺ 0.120兲2 /.009 ⫹ 共0.125 ⫺ 0.120兲2 /0.002 ⫹ 共0.00 ⫺ 0.120兲2 /0.007 ⫽ 3.947. The 95% critical value of the central chi-square distribution with (3 ⫺ 1) ⫽ 2 degrees of freedom is 5.991. The value of the cumulative distribution of the noncentral chi-square with noncentrality parameter 3.947 and 2 degrees of freedom is 0.590. Thus, the power of the ␣ ⫽ .05 level test to detect this pattern of mean differences is 1 ⫺ 0.59 ⫽ 0.41. Note that this is somewhat lower than the power to detect the difference between the effect sizes of just the middleand low-SES groups.

Contrasts Among Group Mean Effect Sizes Just as contrasts are used to explore differences among group means in ANOVA, contrasts can be used to explore group mean effect sizes in meta-analysis. A contrast parameter is a linear combination of group mean effect sizes of the form

␥ ⫽ c 1␪៮ 1• ⫹ c 2␪៮ 2• ⫹ · · · ⫹ c m␪៮ p•,

(10)

where the coefficients c1, c2, . . . , cp are known coefficients that satisfy the constraint c1 ⫹ c2 ⫹ . . . ⫹ cp ⫽ 0 and are chosen to reflect a particular comparison or pattern of interest. For example, if p ⫽ 4, the coefficients c1 ⫽ 1, c2 ⫽ 0, c3 ⫽ 0, c4 ⫽ ⫺1 might be chosen to compare the mean effect sizes of Groups 1 and 4.

HEDGES AND PIGOTT

432

The contrast parameter given in Equation 10 is usually estimated by the sample contrast G ⫽ c 1T៮ 1• ⫹ c 2T៮ 2• ⫹ · · · ⫹ c pT៮ p•,

(11)

which has a normal sampling distribution with mean ␥ and variance v G ⫽ c 12v 1• ⫹ c 22v 2• ⫹ · · · ⫹ c p2v p•.

(12)

Because G has a normal distribution with known variance, statistical tests of the hypothesis that ␥ ⫽ 0 can be carried out using the test statistic Z G ⫽ G/兹v G, which has the standard normal distribution when the null hypothesis is true. When the null hypothesis is false, ZG has a normal distribution with mean ␥/公vG and variance 1. Planned comparisons. The one-sided test for a planned comparison at significance level ␣ is carried out by rejecting the null hypothesis that ␥ ⫽ 0 if ZG ⬎ c␣, where c␣ is the 100(1 ⫺ ␣) percent point of the standard normal distribution (e.g., c␣ ⫽ 1.645 for ␣ ⫽ .05). Because the one-tailed test that ␥ ⫽ 0 at level ␣ rejects when ZG ⬎ c␣, the power of the one-tailed test given by 1 ⫺ ⌽共c ␣ ⫺ ␥ /兹v G兲,

(13)

where ⌽(x) is the standard normal cumulative distribution function. Computation of power of the two-sided test is slightly more complicated. Because the two-sided test of the hypothesis that ␥ ⫽ 0 at level ␣ rejects when 兩ZG兩 ⬎ c␣/2, that is, if ZG ⬎ c␣/2 or if ZG ⬍ ⫺c␣/2, the power of the two-sided test is given by 1 ⫺ ⌽共c ␣/ 2 ⫺ ␥ /兹v G兲 ⫹ ⌽共⫺c␣ / 2 ⫺ ␥/兹vG 兲.

(14)

Unplanned or post hoc comparisons. Two procedures for carrying out two-tailed unplanned (post hoc) comparisons were described by Hedges and Olkin (1985). One procedure is a generalization of the Bonferroni procedure to obtain simultaneous tests of l comparisons in which the familywise Type I error rate of all l tests is controlled to be less than ␣. In this procedure the significance level ␣ used to obtain the critical value is replaced by ␣/2l. Thus the power of the two-tailed test is computed exactly as in Equation 14, except that c␣/2 is replaced by c␣/2l. For example, for l ⫽ 5 and ␣ ⫽ .05, c0.05/2 ⫽ c0.025 ⫽ 1.96 is replaced with c0.05/10 ⫽ c0.005 ⫽ 2.58. The second procedure is a generalization of the Scheffe´ procedure. In this procedure the familywise Type I error rate of any number of contrasts is controlled to be less than ␣. In this procedure the critical value c␣/2 is replaced by 公C(p ⫺ 1; ␣), where C(p ⫺ 1; ␣) is the 100(1 ⫺ ␣) percentile point

of the chi-square distribution with p ⫺ 1 degrees of freedom. Thus, the power of the two-tailed test is computed exactly as in Equation 14, except that c␣/2 is replaced by 公C(p ⫺ 1; ␣). For example, for p ⫽ 5 and ␣ ⫽ .05, c0.05/2 ⫽ 1.96 is replaced with 公C(5 ⫺ 1; ␣) ⫽ 公9.488 ⫽ 3.08. Example of planned and post hoc comparisons. Returning to the phonics instruction data in Table 1, we consider the planned contrast for the comparison of studies with low-income children versus the average of the studies with middle- and varied-income children, tested at the ␣ ⫽ .05 level of significance. We can pose a value of the contrast parameter of ␥ ⫽ 0.25, with variance equal to v G ⫽ 共1兲 2共v low• 兲 ⫹ 共1/2兲2 共vmid• 兲 ⫹ 共1/2兲2 共vmixed• 兲 ⫽ 共0.009兲 ⫹ 共0.007/4兲 ⫹ 共0.002/4兲 ⫽ 0.011. For the one-tailed test, the 95% critical value of the standard normal distribution is 1.645. The value of the standard normal cumulative distribution function at c␣ ⫺ ␥/公vG ⫽ 1.645 ⫺ 0.25/公0.011 ⫽ ⫺0.739 is equal to 0.23 when ␣ ⫽ .05. Thus, the power for the one-tailed test of the planned comparison is 1 ⫺ 0.23 ⫽ 0.77. For a two-sided test of the planned comparison, we need two values of the standard normal cumulative distribution. The 95% critical values for a two-tailed test using the standard normal distribution are 1.96 and ⫺1.96. We need the value of the standard normal cumulative distribution at 1.96 ⫺ 0.25/公0.011 ⫽ ⫺0.424, and at ⫺1.96 ⫺ 0.25/公0.011 ⫽ ⫺4.344. The power of the two-sided test is 1 ⫺ ⌽(⫺0.424) ⫹ ⌽(⫺4.344) ⫽ 1 ⫺ 0.34 ⫹ 0.00 ⫽ 0.66. The value of the contrast computed using the mean effect sizes in Table 1 is 0.24 so that the value of the contrast parameter used in this example is consistent with estimates from the data in Table 1. Note that the power is marginally higher than that of the omnibus test, but still lower than the 80% recommended by Cohen (1977). We can compute the power of post hoc comparisons using either the Bonferroni or the Scheffe´ procedure. Let us assume we wish to compare the effect sizes from the three groups of studies, and we have two statistical comparisons in mind, making l ⫽ 2. The simultaneous test via the Bonferroni method at level ␣ ⫽ .05 uses critical values for the standard normal at the ␣/(2l) ⫽ .05/4 ⫽ .0125 level, giving critical values of ⫺2.24 and 2.24. If we are interested in detecting a comparison with a value of 0.25 and the same variance we computed above, we find c␣/2l ⫺ ␥/公vG ⫽ 2.24 ⫺ 0.25/公0.011 ⫽ ⫺0.144, and ⫺c␣/2l ⫺ ␥/公vG ⫽ ⫺2.24 ⫺ 0.25/公0.011 ⫽ ⫺4.624. The power of the twosided test is 1 ⫺ ⌽(⫺0.144) ⫹ ⌽(⫺4.624) ⫽ 1 ⫺ 0.44 ⫹ 0.00 ⫽ 0.56. The Scheffe´ procedure provides the power for any number of contrasts. Let us assume that one of the contrasts we will compute has a value of 0.25 and variance of 0.012 as in

POWER OF TESTS FOR MODERATORS IN META-ANALYSIS

our prior examples. In this example, we have p ⫽ 3 so that we need the critical value of the chi-square distribution at ␣ ⫽ .05, or C(3 ⫺ 1; 0.05). The critical value of the chi-square distribution with 2 degrees of freedom and ␣ ⫽ .05 is 5.991. We replace c␣/2 with 公C(2; 0.05) ⫽ 公5.991 ⫽ 2.448 in Equation 14. We obtain the power of the two-sided test by calculating 公C(2; 0.05) ⫺ ␥/公vG ⫽ 2.448 ⫺ 0.25/公0.011 ⫽ 0.064, and ⫺公C(2; 0.05) ⫺ ␥/公vG ⫽ ⫺2.448 ⫺ 0.25/公0.011 ⫽ ⫺4.832. Thus, the power is 1 ⫺ ⌽(0.064) ⫹ ⌽(⫺4.832) ⫽ 1 ⫺ 0.53 ⫹ 0.00 ⫽ 0.47. Although the power of the Scheffe´ procedure is lower than that of the Bonferroni procedure in this example, it need not be, particularly when the number of contrasts l is large.

Tests of Within-Group Heterogeneity The test of within-group homogeneity of effect sizes is sometimes used as a test of goodness of fit of the fixedeffects model. Formally, this is a test of the hypothesis H 0: ␪ ij ⫽ ␪៮ i•, i ⫽ 1, . . . , p

(15)

versus the alternative that at least some of the ␪ij differ from ␪៮ i•. The test uses the test statistic

冘 冘 w 共T ⫺ T៮ 兲 , p

QE ⫽

mi

2

ij

ij

i•

(16)

i⫽1 j⫽1

which has the chi-square distribution with (k ⫺ p) degrees of freedom when the hypothesis of homogeneity of effect sizes within groups is true. The test at significance level alpha rejects the null hypothesis when QE ⬎ c␣, where c␣ is the 100(1 ⫺ ␣) percentile point of the (central) chi-square distribution with (k ⫺ p) degrees of freedom. When the null hypothesis is false, that is, when some of the effect size parameters within groups differ from one another, then QE has a noncentral chi-square distribution with (k ⫺ p) degrees of freedom and noncentrality parameter ␭E given by

冘 冘 w 共␪ ⫺ ␪៮ 兲 . p

␭E ⫽

433

effect size requires the value of ␭E, the noncentrality parameter of substantive interest. This depends on the number of studies in each group, the sampling (conditional) variance of each study, and the effect size parameters of each study, which may be difficult to guess in the preliminary stages of a meta-analytic study. If the variances vij of the individual effect size estimates are identical to v or nearly so, then one might treat the inner sum in the expression for ␭E as (mi ⫺ 1)/v times the “variance” of the ␪ij values in group i, and thus ␭E is (k ⫺ p) times the ratio of the between-studies “variance” to the within-study variance. Using empirical findings from Schmidt (1992) about the ratio of between-studies to withinstudies variance, Hedges and Pigott (2001) suggested a convention that corresponds to assuming ␭E ⫽ 0.33(k ⫺ p) is a small degree of heterogeneity, ␭E ⫽ 0.67(k ⫺ p) is a moderate degree of heterogeneity, and ␭E ⫽ (k ⫺ p) is a large degree of heterogeneity. Example of a test of within-group heterogeneity. Returning to the phonics instruction data, our three-group example (comparing the mean effect sizes for studies with low-income, middle-income, and varied-income students), we have k, the total number of studies, equal to 30, and p, the number of groups, equal to 3, and the significance level is ␣ ⫽ .05. Assuming a small degree of heterogeneity, we obtain ␭E ⫽ 0.33(30 ⫺ 3) ⫽ 0.33(27) ⫽ 8.91. The 95% critical value for the central chi-square distribution with 27 degrees of freedom is 40.11. The value of the cumulative distribution of the noncentral chi-square at the critical value of 40.11 with 27 degrees of freedom and noncentrality parameter 8.91 is equal to 0.70. The power of the test to detect a small degree of heterogeneity is then 1 ⫺ 0.70 ⫽ 0.30. With a moderate degree of heterogeneity, the noncentrality parameter is ␭E ⫽ 0.67(27) ⫽ 18.09. The power of the test to detect a moderate degree of heterogeneity is 1 ⫺ H(40.11兩27, 18.09) ⫽ 1 ⫺ 0.35 ⫽ 0.65. A high degree of heterogeneity results in a noncentrality parameter equal to k ⫺ p ⫽ 27. The power is then 1 ⫺ H(40.11兩27, 27) ⫽ 1 ⫺ 0.13 ⫽ 0.87. This example illustrates that heterogeneity tests do not necessarily have high power to detect small or moderate degrees of heterogeneity.

mi

2

ij

ij

i•

(17)

Mixed-Effects Procedures

i⫽1 j⫽1

The power of the test based on QE at significance level ␣ is therefore 1 ⫺ H共c ␣兩k ⫺ p; ␭ E兲,

(18)

where H(x兩v; ␭) is the cumulative distribution function of the noncentral chi-square with v degrees of freedom and noncentrality parameter ␭. Choosing values for the noncentrality parameter. The statistical power of the test of within-group homogeneity of

Under the fixed-effects model, the ␪ij were fixed, but unknown, constants. Under this assumption, the variance of Tij is simply vij. In the mixed model, the ␪ij are not fixed but sampled from a universe of ␪ij values (corresponding to a universe of studies in the ith group of studies). Therefore, it is necessary to distinguish between the variance of Tij assuming fixed ␪ij and the variance of Tij that incorporates the variance of ␪ij as well. The former is the conditional sampling variance of Tij, and the latter is the unconditional sampling variance of Tij.

HEDGES AND PIGOTT

434

It is conventional to decompose the observed effect size estimate into fixed and random components T ij ⫽ ␪ ij ⫹ ␧ ij ⫽ ␮ i ⫹ ␰ ij ⫹ ␧ ij,

(19)

where ␧ij is the sampling error of Tij as an estimate of ␪ij and ␪ij itself can be decomposed into the mean ␮i of the population of effects from which the ith group of effects are sampled and the error ␰ij of ␪ij as an estimate of ␮i. In this decomposition only ␮i is fixed, and we assume that ␰ij and ␧ij are random and independently normally distributed with mean 0. The variance of ␧ij is vij, the conditional sampling error variance of Tij, which is known. The variance of the population from which the ␰ij are sampled is ␶2, assumed to be the same across groups. Equivalently, we might say that ␶2 is the within-groups (of studies) variance of the population effect size parameters, which is why ␶2 is often called the between-studies-within-groups variance component. Because the effect size ␪ij is a value obtained by sampling from a distribution of potential values, the unconditional sampling variance of Tij involves ␶2. In particular, the unconditional sampling variance of Tij is v *ij ⫽ v ij ⫹ ␶ 2.



Q E ⫺ 共k ⫺ p兲 if QE ⱖ 共k ⫺ p兲 a , 0 if QE ⬍ 共k ⫺ p兲

(21)

冘w

ai ⫽

冘w ⫺ ij

j⫽1

冘w

The weighted mean effect size for the ith group is T៮ *i•, given by

2 ij

ij ij

T៮ *i• ⫽

j⫽1

.

冘 w* mi

ij

j⫽1

The weighted mean T៮ *i• has sampling distribution given by T៮ *i• ⬃ N共␮i , v*i• 兲, where v*i• is given by

冘 w*. mi

共v *i•兲 ⫺1 ⫽

(24)

ij

j⫽1

The weighted grand mean T៮ *•• is given by

冘 w*T៮ * p

i•

T៮ *•• ⫽

i•

i⫽1

冘 w* p

,

i•

i⫽1

and w*i• ⫽ 1/v*i•.

Omnibus Tests for Between-Groups Differences

(25)

(analogous to the F test in ANOVA), uses the test statistic ,

mi

冘 w*T mi

H 0: ␮ 1 ⫽ ␮ 2 ⫽ · · · ⫽ ␮ p

mi

j⫽1

(23)

The omnibus test of the null hypothesis that the group mean effect sizes are equal,

where a ⫽ a1 ⫹ a2 ⫹ . . . ⫹ ap, and ai is given by

mi

w *ij ⫽ 1/关 ␶ˆ 2 ⫹ v ij兴.

(20)

Methods of estimation in mixed-effects statistical procedures for meta-analysis typically involve estimating ␶2 and then computing weights that are the reciprocals of v*ij. The methods usually used to estimate ␶2 are method of moments estimators analogous to those used to estimate variance components in ANOVA (see, e.g., DerSimonian & Laird, 1986; Hedges, 1992; Raudenbush, 1994; Schmidt & Hunter, 1977). Estimation of ␶2 uses the same principles as the estimation of variance components in ANOVA. One estimate of ␶2 is

␶ˆ 2 ⫽

the reciprocals of the unconditional variances (v*ij’s) rather than the conditional variances (vij’s) of the effect size estimates. Because we seldom know the exact value of ␶2 to compute v*ij, we usually construct the weights by substituting the estimate ␶ˆ 2 for ␶2 in computing the weights to get

(22)

ij

j⫽1

where the wij ⫽ 1/vij are the weights used in the fixedeffects analysis and QE is the statistic given in Equation 16 in connection with testing the within-group homogeneity of effect sizes. The logic of the weighting in mixed-model analyses is the same as in fixed-effects procedures, but here the weights are

冘 w*共T៮ * ⫺ T៮ *兲 , p

Q *B ⫽

i•

i•

••

2

(26)

i⫽1

which has the chi-square distribution with (p ⫺ 1) degrees of freedom when the null hypothesis is true. Therefore, the omnibus test at significance level ␣ rejects the null hypothesis when Q*B ⬎ c␣, where c␣ is the 100(1 ⫺ ␣) percentile point of the (central) chi-square distribution with (p ⫺ 1) degrees of freedom.

POWER OF TESTS FOR MODERATORS IN META-ANALYSIS

When the null hypothesis is false, that is, when some of the group mean effect sizes differ from one another, then Q*B has a noncentral chi-square distribution with (p ⫺ 1) degrees of freedom and noncentrality parameter ␭*B given by

冘 w*共␮ ⫺ ␮៮ 兲 , p

␭ *B ⫽

i•



i

2

(27)

i⫽1

and ␮៮ • is the weighted average of the group mean effect sizes ␮1, ␮2, . . . , ␮p given by

冘 w*␮ p

i•

␮៮ • ⫽

i⫽1

冘 w* p

i

.

(28)

i•

i⫽1

The power of the test based on Q*B at significance level ␣ is therefore 1 ⫺ H共c ␣兩p ⫺ 1; ␭ *B兲,

(29)

where H(x兩v; ␭) is the cumulative distribution function of the noncentral chi-square with v degrees of freedom and noncentrality parameter ␭. This distribution is tabulated and is widely available in statistical software such as SAS, SPSS, SPlus, and STATA. Choosing values for the noncentrality parameter. Using Equation 29 to compute the statistical power of the test of between-groups differences in effect size requires a value of ␭*B. If the studies have already been collected and data extracted from the studies, then the number of studies in each group and the sampling variances of each group mean effect size (which is derived from the sampling variances of the individual effect sizes) will be known. The only other elements that are required to compute ␭*B are the ␮i values. In some circumstances it may be possible to guess the ␮i values directly, so that Equation 28 can be used to obtain ␭*B. In other cases, particularly in the preliminary stages of a meta-analysis, it may be difficult to guess the ␮i values of interest (and even the vi• values), so approaches to computing ␭*B that do not require guessing all of the ␮i values may be easier to use. We give two such approaches. If p ⫽ 2 and there are only two groups, then ␭*B reduces to

␭ *B ⫽

共␮1 ⫺ ␮2 兲2 , v*1• ⫹ v*2•

(30)

so that values of ␭*B may be computed from the smallest difference among group mean effect sizes that is of substantive importance, along with the (mixed-model) vari-

435

ances of the group mean effect size estimates for each group. If data are not available to compute v*i•, it is necessary to guess these values to compute ␭*B and the power of the test. In doing so, an issue arises in the mixed-effects analysis that does not arise in the fixed-effects analysis. The noncentrality parameter ␭*B depends not only on the group mean effect size parameters and the conditional variances of the effect size estimates but also on the between-studies–withingroups variance component ␶2. Hedges and Pigott (2001) suggested a convention for describing heterogeneity based on the ratio between ␶2 and the typical sampling error variance v, with ␶2/v ⫽ 1/3 corresponding to a small degree of heterogeneity, ␶2/v ⫽ 2/3 corresponding to a moderate degree of heterogeneity, and ␶2/v ⫽ 1 corresponding to a large degree of heterogeneity. If the variances v*ij of the individual effect size estimates are identical to v* or nearly so (which is equivalent to saying that all the vij are identical to v or nearly so), and if all groups have the same number, m, of effect sizes, then w*i• ⫽ m/v* ⫽ m/(v ⫹ ␶2) and one might treat ␭*B as m(p ⫺ 1)/v* ⫽ m(p ⫺ 1)/(v ⫹ ␶2) times the “variance” of the ␮i values, and thus ␭*B is m(p ⫺ 1)v/(v ⫹ ␶2) times the ratio of the “variance” between group mean effect sizes to the within-study variance. Thus, the noncentrality parameter ␭*B is a constant times the ratio R of the variance of group mean effect sizes to within experiment variance (loosely, R ⫽ Variance(␮i)/v). Using the conventions above, for a small degree of heterogeneity between studies but within groups ␭*B ⫽ 3Rm(p ⫺ 1)/4, for medium degree of heterogeneity is ␭*B ⫽ 3Rm(p ⫺ 1)/5, and for a large degree of heterogeneity is ␭*B ⫽ Rm(p ⫺ 1)/2. Note that we have described two different variance ratios above. One is the ratio ␶2/v, the ratio of the between-studies but within-groups of studies variance (␶2) to the withinstudy variance (v). This first ratio describes the degree of within-group heterogeneity of study results. The second is the ratio R ⫽ Variance(␮i)/v, the ratio of the variance of group mean effect sizes [Variance(␮i)] to the within experiment variance (v). This second ratio, R, describes the degree of variation of the group mean effect sizes. Comparing the noncentrality parameters that would be calculated with vij ⫽ v in the case of fixed- and randomeffects procedures with the same degree of between-group– mean effect size variation, we see that with a small degree of between-study–within-group heterogeneity ␭*B ⫽ 3␭B/4, with a moderate degree of between-study–within-group heterogeneity ␭*B ⫽ 3␭B/5, and with a large degree of betweenstudy–within-group heterogeneity ␭*B ⫽ ␭B/2. Because the ␭*B is less than the corresponding ␭B, the power for mixedeffects procedures will be lower than that of the corresponding fixed-effects procedures. Example of a mixed-effects omnibus test. We can use the results in Table 1 to compute the power of the ␣ ⫽ .05

436

HEDGES AND PIGOTT

level test of differences in group mean effect sizes in a mixed-effects model. In this example, we examine the differences between the low-, middle-, and varied-income groups. If we compute ␶ˆ 2 from these data using QE ⫽ 5.56 ⫹ 52.70 ⫹ 15.11 ⫽ 73.37 and a ⫽ alow ⫹ amiddle ⫹ amixed ⫽ 82.33 ⫹ 128.59 ⫹ 351.08 ⫽ 562.00, we obtain ␶ˆ 2 ⫽ [QE ⫺ (k ⫺ p)]/a ⫽ [73.37 ⫺ (30 ⫺ 3)]/562.00 ⫽ 46.37/562.00 ⫽ 0.082. Adding ␶ˆ 2 ⫽ 0.082 to the variance of each effect size in Table 1, we obtain the random effects estimate of the variance of the effect size means in each group. If we were only interested in detecting a difference of 0.25 between the low- and middle-income groups, we would use Equation 30 to compute the noncentrality parameter ␭*B ⫽ (␮1 ⫺ ␮2)2/(v*1• ⫹ v*2•) ⫽ (.25)2/(0.025 ⫹ 0.016) ⫽ 1.524. The 95% critical value for the central chi-square with 1 degree of freedom is 3.841. Using the cumulative distribution of the noncentral chi-square with 1 degree of freedom and noncentrality parameter equal to 1.524, the power of the omnibus test in the random effects model is 1 ⫺ H(3.841兩1; 1.524) ⫽ 1 ⫺ 0.77 ⫽ 0.23. We have low power to detect a difference of 0.25 between these two means with a variance component equal to 0.82. Note that the power of the mixed-effects test is even lower than that of the corresponding fixed-effects test computed earlier. If we did not have either the variances for all of the individual effect sizes or a direct estimate of the betweenstudies variance component, we could use estimates of the average values and the conventions discussed above. We would assume that the variances of individual effect sizes are identical or nearly so and that we have a similar number, m, of studies in each group. Using our three-group example in Table 1, we could posit m ⫽ 10, the average number of studies in the three groups. We might further assume (based on knowledge of research designs in the field or an estimate from a sample of studies) that the typical study had a sample size of 25 per group and therefore an effect size variance of about v ⫽ 0.08. Suppose that we desire the power to detect three group mean effect sizes of 0.000, 0.125, and 0.250 (a difference of 0.25 between the low- and middle-SES groups, with the mixed-SES group halfway in between). The variance of these three group mean effect sizes is about 0.016, so the ratio R ⫽ Variance(␮i)/v ⫽ 0.016/0.08 ⫽ 0.20, or in other words the variance between group mean effect sizes, is expected to be about 20% of the within-study variance. Using the conventions on heterogeneity, if we have a small degree of heterogeneity, the noncentrality parameter ␭*B is equal to 3R(10)(3 ⫺ 1)/4 ⫽ 60R/4 ⫽ 15.0R ⫽ 15(0.20) ⫽ 3.0. The power of the test to detect a mean difference with a small degree of heterogeneity is 1 ⫺ H(5.991兩2; 3.0) ⫽ 1 ⫺ 0.678 ⫽ 0.322. With a medium degree of heterogeneity, the noncentrality parameter is equal to 3(10)(3 ⫺ 1)R/5 ⫽ (12.0)(0.20) ⫽ 2.4, resulting in the power to detect a mean difference of 1 ⫺ H(5.991兩2;

2.4) ⫽ 1 ⫺ 0.736 ⫽ 0.264. A large degree of heterogeneity has a noncentrality parameter equal to 10(3 ⫺ 1)R/2 ⫽ 10(0.20) ⫽ 2.0, and the power to detect a group mean difference is 1 ⫺ H(5.991兩2; 1.0) ⫽ 1 ⫺ 0.867 ⫽ 0.133. Comparing this power value with the corresponding value computed in the fixed-effects analysis, we see that it is lower, as expected. It would be difficult to justify the use of a test with such low power or drawing strong conclusions from it if the null hypothesis were not rejected.

Contrasts Among Group Mean Effect Sizes Just as contrasts are used to explore differences among group means in ANOVA and fixed-effects meta-analysis, contrasts can be used to explore group mean effect sizes in mixed-model meta-analysis. In the latter case a contrast parameter is a linear combination of group mean effect sizes of the form

␥ * ⫽ c 1␮ 1 ⫹ c 2␮ 2 ⫹ · · · ⫹ c p␮ p,

(31)

where the coefficients c1, c2, . . . , cm are known coefficients that satisfy the constraint c1 ⫹ c2 ⫹ . . . ⫹cp ⫽ 0 and are chosen to reflect a particular comparison or pattern of interest. The contrast parameter given in Equation 31 is usually estimated by the sample contrast G* ⫽ c 1T៮ *1• ⫹ c 2T៮ *2• ⫹ · · · ⫹ c pT៮ *p•,

(32)

which has a normal sampling distribution with mean ␥ and variance v *G ⫽ c 12v *1• ⫹ c 22v *2• ⫹ · · · ⫹ c p2v *p•.

(33)

Because G* has a normal distribution with known variance, statistical tests of the hypothesis that ␥ ⫽ 0 can be carried out using the test statistic Z *G ⫽ G*/兹v *G, which has the standard normal distribution when the null hypothesis is true. When the null hypothesis is false, Z*G has a normal distribution with mean ␥*/公v*G and variance 1. Planned comparisons. The one-sided test for a planned comparison at significance level ␣ is carried out by rejecting the null hypothesis that ␥* ⫽ 0 if Z*G ⬎ c␣, where c␣ is the 100(1 ⫺ ␣) percent point of the standard normal distribution (e.g., c␣ ⫽ 1.645 for ␣ ⫽ .05). Because the one-tailed test that ␥* ⫽ 0 at level ␣ rejects when Z*G ⬎ c␣, the power of the one-tailed test is given by 1 ⫺ ⌽共c ␣ ⫺ ␥ */兹v *G兲,

(34)

where ⌽(x) is the standard normal cumulative distribution function. Computation of power of the two-sided test is slightly

POWER OF TESTS FOR MODERATORS IN META-ANALYSIS

more complicated. Because the two-sided test of the hypothesis that ␥* ⫽ 0 at level ␣ rejects when 兩Z*G兩 ⬎ c␣/2, that is, if Z*G ⬎ c␣/2 or if Z*G ⬍ ⫺c␣/2, the power of the two-sided test is given by 1 ⫺ ⌽共c ␣/ 2 ⫺ ␥ */兹v *G兲 ⫹ ⌽共⫺c␣ / 2 ⫺ ␥*/兹v*G 兲.

(35)

Unplanned or post hoc comparisons. Procedures for estimating power of two-tailed unplanned (post hoc) comparisons are similar to those in the fixed-effects case. The generalization of the Bonferroni procedure controls the familywise Type I error rate of l comparisons to be less than ␣. In this procedure the significance level ␣ used to obtain the critical value is replaced by ␣/2l. Thus, the power of the two-tailed test is computed exactly as in Equation 35, except that c␣/2 is replaced by c␣/2l. The generalization of the Scheffe´ procedure controls the familywise Type I error rate of any number of contrasts to be less than ␣. In this procedure the critical value c␣/2 is replaced by 公C(p ⫺ 1; ␣), where C(p ⫺ 1; ␣) is the 100(1 ⫺ ␣) percentile point of the chi-square distribution with p ⫺ 1 degrees of freedom. Thus, the power of the two-tailed test is computed exactly as in Equation 35, except that c␣/2 is replaced by 公C(p ⫺ 1; ␣). Conventions for heterogeneity. Computation of v*G requires both the conditional variances vij in each group and the between-studies variance component ␶2. As in the case of the omnibus test above, we adopt the convention with a common value of v in each study that ␶2 ⫽ v/3 is a small degree of heterogeneity, ␶2 ⫽ 2v/3 is a moderate degree of heterogeneity, and ␶2 ⫽ v is a large degree of heterogeneity. With these conventions for heterogeneity, then v*i• ⫽ (v ⫹ ␶2)/mi. Example of planned and post hoc comparisons. Return to the phonics instruction data in Table 1. As in the fixedeffects model example, we can pose a value of the contrast parameter of ␥ ⫽ 0.25 for the comparison of the mean effect size of studies with low-income children versus the average of the mean effect sizes for the remaining two groups of studies. With a value of ␶ˆ 2 ⫽ 0.082, the variance of the contrast is equal to vG ⫽ (1)2(v*low•) ⫹ (1/2)2(v*mid•) ⫹ (1/2)2(v*mixed•) ⫽ (0.025) ⫹ (0.016/4) ⫹ (0.010/4) ⫽ 0.032. For the one-tailed test, the 95% critical value of the standard normal distribution is 1.645. The value of the standard normal cumulative distribution function at c␣ ⫺ ␥/公vG ⫽ 1.645 ⫺ 0.25/公0.032 ⫽ 0.247 is equal to 0.60 when ␣ ⫽ .05. Thus, the power for the one-tailed test of the planned comparison is 1 ⫺ 0.60 ⫽ 0.40. For a two-sided test of the planned comparison, we need two values of the standard normal cumulative distribution. The 95% critical values for a two-tailed test using the standard normal distribution are 1.96 and ⫺1.96. We need the value of the standard normal cumulative distribution at 1.96 ⫺ 0.25/公0.032 ⫽ 0.56, and at ⫺1.96 ⫺ 0.25/公0.032 ⫽ ⫺3.36. The power of the

437

two-sided test is 1 ⫺ ⌽(0.56) ⫹ ⌽(⫺3.36) ⫽ 1 ⫺ 0.71 ⫹ 0.00 ⫽ 0.29. The value of the contrast in Table 1 using the computed mean effect sizes is 0.24 so that the parameter value in this example is consistent with estimates from the data in Table 1. We can compute the power of post hoc comparisons using either the Bonferroni or the Scheffe´ procedures. Let us assume we wish to compare the effect sizes from the three groups of studies, and we have two statistical comparisons in mind, making l ⫽ 2. The significance level ␣ for the simultaneous test of two tests is given by ␣/(2l) ⫽ 0.05/4 ⫽ 0.0125. The critical values of the two-sided test of the standard normal distribution are ⫺2.24 and 2.24 for ␣ ⫽ .0125. If we are interested in detecting a comparison with a value of 0.25 and the same variance as we computed above, we find c␣/2l ⫺ ␥*/公v*G ⫽ 2.24 ⫺ 0.25/公0.032 ⫽ 0.84, and ⫺c␣/2l ⫺ ␥*/公v*G ⫽ ⫺2.24 ⫺ 0.25/公0.032 ⫽ ⫺3.64. The power of the two-sided test is 1 ⫺ ⌽(0.84) ⫹ ⌽(⫺3.64) ⫽ 1 ⫺ 0.80 ⫹ 0.00 ⫽ 0.20. The Scheffe´ procedure provides the power for any number of contrasts. Let us assume that one of the contrasts we will compute has a value of 0.25 and variance of 0.032 as in our prior examples. In this example, we have p ⫽ 3, so that we need the critical value of the chi-square distribution at ␣ ⫽ .05, or C(3 ⫺ 1; 0.05). The critical value of the chi-square distribution with 2 degrees of freedom and ␣ ⫽ .05 is 5.991. We replace c␣/2 with 公C(2; 0.05) ⫽ 公5.991 ⫽ 2.448 in Equation 15. We obtain the power of the two-sided test by calculating 公C(2; 0.05) ⫺ ␥*/公v*G ⫽ 2.448 ⫺ 0.25/公0.032 ⫽ 1.05, and ⫺公C(2; 0.05) ⫺ ␥*/ 公v*G ⫽ ⫺2.448 ⫺ 0.25/公0.032 ⫽ ⫺3.84. Thus, the power is 1 ⫺ ⌽(1.05) ⫹ ⌽(⫺3.84) ⫽ 1 ⫺ 0.85 ⫹ 0.00 ⫽ 0.15, somewhat less than the power using the Bonferroni procedure.

Tests of the Between-Studies Variance Component The test that ␶2 ⫽ 0 in the mixed model is the same as the test of homogeneity using QE in the fixed-effects model. The reason is that if ␶2 ⫽ 0, then for each group (say, the ith group) ␪i1 ⫽ ␪i2 ⫽ . . . ⫽ ␪imi ⫽ ␮i, and the fixed-effects null hypothesis is true. However the non-null distribution of the QE statistics differs in the fixed and mixed models. This is analogous to the situation in one-way ANOVA where the null distributions of the F ratios are the same in the fixedeffects and random effects models but the non-null distributions differ. Under the mixed-model assumptions, as in the fixedeffects case, the test statistic QE has a chi-square distribution with k ⫺ p degrees of freedom when the null hypothesis that ␶2 ⫽ 0 is true. When the null hypothesis of homogeneity is false, that is, when ␶2 ⬎ 0, and the conditional variances are all equal, that is, v11 ⫽ v12 ⫽ . . . ⫽ v1m1 ⫽ v21 ⫽ . . . ⫽ vpmp ⫽ v, then QE has a distribution that is (v

HEDGES AND PIGOTT

438

⫹ ␶2)/v times a central chi-square distribution with (k ⫺ p) degrees of freedom, so the power of the test that ␶2 ⫽ 0 is 1 ⫺ H关c ␣v/共v ⫹ ␶ 2兲兩k ⫺ p; 0兴,

(36)

where H(x兩v; 0) is the cumulative distribution of the central chi-square distribution with v degrees of freedom. When the conditional variances are unequal, QE has a distribution of rather complex form (a weighted sum of chi-square distributions) that is not tabulated. However, an approximation to that distribution that is adequate for estimating statistical power is known (Satterthwaite, 1946). It approximates the distribution of QE by a gamma distribution with mean and variance equal to that of QE. The mean ␮Q under this model is

␮ Q ⫽ a ␶ 2 ⫹ 共k ⫺ p兲,

(37)

and the variance of QE is given by

␴ Q2 ⫽ 2b,

(38)

where a is given by Equation 22, b ⫽ b1 ⫹ . . . ⫹ bp, and bi is given by



b i ⫽ 共m i ⫺ 1兲 ⫹ 2 S1 i ⫺



S2 i 2 ␶ S1 i



⫹ S2 i ⫺ 2



S3 i 共S2 i兲 2 4 ⫹ ␶, S1 i 共S1 i兲 2

(39)

and S1i, S2i, and S3i are the sums of the (fixed-effects) weights, squared weights, and cubed weights in group i, respectively given by

冘w,

(40)

冘w,

(41)

冘w.

(42)

mi

S1 i ⫽

ij

j⫽1 mi

S2 i ⫽

2 ij

j⫽1

and mi

S3 i ⫽

3 ij

j⫽1

If we use the chi-square distribution with noninteger degrees of freedom to evaluate the gamma distribution, the power of the test that ␶2 ⫽ 0 is given by 1 ⫺ H共c ␣/r兩s; 0兲,

the central chi-square distribution with s degrees of freedom, c␣ is the 100(1 ⫺ ␣) percentile point of the chi-square distribution with (k ⫺ p) degrees of freedom,

(43)

where H(x兩s; 0) is the cumulative distribution function of

r ⫽ ␴ Q2 / 2 ␮ Q,

(44)

s ⫽ 2共 ␮ Q兲 2/ ␴ Q2 .

(45)

and

Choosing values of ␶2. To use Equation 43 to compute power, one must insert a value of ␶2 into Equation 37. In some applications a plausible value of ␶2 may be suggested by previous research or other considerations. In the absence of such specific considerations, we mentioned previously in this article conventions suggested by Hedges and Pigott (2001) for the size of ␶2 in terms of the typical sampling variance v. Specifically, they suggested that ␶2 ⫽ v/3 is a small degree of heterogeneity, ␶2 ⫽ 2v/3 is a medium degree of heterogeneity, and ␶2 ⫽ v is a large degree of heterogeneity. When all studies have the same sampling error variance v, these conventions for small, moderate, and large heterogeneity correspond to situations in which ␶2 contributes 25%, 40%, and 50%, respectively, of the total variance of the effect size estimates. Example computing the power of the test for the betweenstudies variance component. Return to the phonics instruction data in Table 1 that was used for the omnibus test of mixed effects, and suppose that we wish to compute the power of the ␣ ⫽ .05 level test that ␶2 ⫽ 0. The average sampling error variance is v ⫽ 0.08, but we might also have arrived at this value if we hypothesized that the typical sample size was 25 per group. A small degree of heterogeneity therefore corresponds to ␶2 ⫽ 0.08/3 ⫽ 0.027. A medium degree of heterogeneity corresponds to ␶2 ⫽ 0.08(2/3) ⫽ 0.053, and a large degree of between-studies heterogeneity corresponds to ␶2 ⫽ 0.08. The critical value of the chi-square distribution with 30 ⫺ 3 ⫽ 27 degrees of freedom is 40.11. We can compute a ⫽ 562.00 as in the earlier example of the omnibus test for mixed effects. The constant b depends on the true value of the variance component ␶2. For ␶2 ⫽ 0.027 (small heterogeneity) b ⫽ blow ⫹ bmiddle ⫹ bmixed ⫽ 10.757 ⫹ 17.690 ⫹ 45.618 ⫽ 74.066. We can then compute ␮Q ⫽ a␶2 ⫹ (k ⫺ p) ⫽ 2 (562.0)(0.027) ⫹ (30 ⫺ 3) ⫽ 42.174 and ␴Q ⫽ 2b ⫽ 2(74.066) ⫽ 145.130. We compute the constants r and s as 2 r ⫽ ␴Q /[2(␮Q)] ⫽ 145.130/[2(42.174)] ⫽ 1.721 and s ⫽ 2(42.174)2/145.130 ⫽ 24.511. Using Equation 43, the power to detect heterogeneity of ␶2 ⫽ 0.027 is p ⫽ 1 ⫺ H(40.11/1.721兩24.511; 0) ⫽ 1 ⫺ H(23.306兩24.511; 0) ⫽ 0.53. For ␶2 ⫽ 0.053 (medium heterogeneity) b ⫽ blow ⫹ bmiddle ⫹ bmixed ⫽ 18.779 ⫹ 29.358 ⫹ 102.850 ⫽ 150.987. We can then compute ␮Q ⫽ a␶2 ⫹ (k ⫺ p) ⫽ 2 (562.0)(0.053) ⫹ (30 ⫺ 3) ⫽ 56.786 and ␴Q ⫽ 2b ⫽

POWER OF TESTS FOR MODERATORS IN META-ANALYSIS

2(150.987) ⫽ 301.973. We compute the constants r and s as 2 r ⫽ ␴Q /[2(␮Q)] ⫽ 301.973/[2(56.786)] ⫽ 2.659 and s ⫽ 2(56.786)2/301.973 ⫽ 21.357. Using Equation 43, the power to detect heterogeneity of ␶2 ⫽ 0.053 is p ⫽ 1 ⫺ H(40.11/2.659兩21.357; 0) ⫽ 1 ⫺ H(15.085兩21.357; 0) ⫽ 0.83. For ␶2 ⫽ 0.08 (large heterogeneity) b ⫽ blow ⫹ bmiddle ⫹ bmixed ⫽ 29.68 ⫹ 44.90 ⫹ 189.10 ⫽ 263.68. We can then compute ␮Q ⫽ a␶2 ⫹ (k ⫺ p) ⫽ (562.0)(0.08) ⫹ 2 (30 ⫺ 3) ⫽ 71.96 and ␴Q ⫽ 2b ⫽ 2(263.68) ⫽ 527.36. We 2 compute the constants r and s as r ⫽ ␴Q /[2(␮Q)] ⫽ 527.360/ [2(71.960)] ⫽ 3.664 and s ⫽ 2(71.960)2/527.360 ⫽ 19.638. Using Equation 43, the power to detect heterogeneity of ␶2 ⫽ 0.09 is p ⫽ 1 ⫺ H(40.11/3.664兩19.638; 0) ⫽ 1 ⫺ H(10.947兩19.638; 0) ⫽ 0.94. If we had not known the conditional variances of the individual effect sizes, we might have assumed these variances to be approximately equal and used Equation 36 to compute power. If we had known (or estimated) that the average sample size was 25 per group, we would have computed that the average fixed-effects variance was approximately 0.08. If we do not know the value of ␶2, we could apply our convention discussed previously. The argument of the function on the right-hand side of Equation 36 for a small degree of heterogeneity is given by c␣v/(v ⫹ ␶2) ⫽ 40.11 ⴱ 0.08/(0.08 ⫹ 0.027) ⫽ 29.99. The power of the test is 1 ⫺ H(29.99兩27; 0) ⫽ 1 ⫺ 0.69 ⫽ 0.31. For a medium degree of heterogeneity, the argument is c␣v/(v ⫹ ␶2) ⫽ 40.11 ⴱ 0.08/(0.08 ⫹ 0.053) ⫽ 24.13, and our power is 1 ⫺ H(24.13兩27; 0) ⫽ 1 ⫺ 0.38 ⫽ 0.62. A large degree of heterogeneity yields the argument c␣v/(v ⫹ ␶2) ⫽ 40.11 ⴱ 0.08/(0.08 ⫹ 0.08) ⫽ 20.06, and thus power of 1 ⫺ H(20.06兩27; 0) ⫽ 1 ⫺ 0.17 ⫽ 0.83.

Multiple Regression Analogues Analogues to multiple regression analysis for effect sizes are often used in moderator analyses when moderator variables are continuous or there are multiple moderator variables incorporated in the same analysis. We describe first analogues to fixed-effects regression analysis and then the corresponding procedures that are analogues to mixed-effects regression. In the latter case, the effects of the studylevel moderator variables are taken to be fixed effects, but study-specific effects leading to excess residual variation are taken to be random effects. The principal difference between these two procedures is in the calculation of the variance of the regression coefficients. In fixed-effects procedures, between-studies variation within groups of studies having the same covariate values does not influence the uncertainty of regression coefficient estimates. In mixedmodel procedures, such variation does contribute to the uncertainty of regression coefficient estimates.

439

Fixed-Effects Procedures Let T1, T2, . . . , Tk be estimates of the effect size parameters ␪1, ␪2, . . . , ␪k from k independent studies having known standard errors 公v1, 公v2, . . . , 公vk. Suppose that each effect size estimate is approximately normally distributed so that T i ⬃ N共␪i , vi 兲. Suppose further that the effect size parameters are determined by p moderator variables X1, X2, . . . , Xp, via

␪ i ⫽ ␤ 0 ⫹ ␤ 1x i1 ⫹ ␤ 2x i2 ⫹ · · · ⫹ ␤ px ip,

(46)

where xij is the value of the ith study on the jth moderator variable and ␤0, . . . , ␤p are unknown regression coefficients. Alternatively, we might say that T i ⫽ ␤ 0 ⫹ ␤ 1x i1 ⫹ ␤ 2x i2 ⫹ · · · ⫹ ␤ px ip ⫹ ⑀ i,

(47)

where ⑀ij ⬅ Ti ⫺ ␪i and ⑀i ⬃ N(0, vi). It is easiest to express the model and estimates in matrix form in terms of a k ⫻ 1 vector of effect size estimates T ⫽ (T1, . . . , Tk)⬘, a k ⫻ 1 vector of residuals ⑀ ⫽ (⑀1, . . . , ⑀k)⬘, a (p ⫹ 1) ⫻ 1 vector of regression coefficients ␤ ⫽ (␤0, . . . , ␤p)⬘, and the k ⫻ (p ⫹ 1) design matrix X (assumed to be of full rank) whose first column is a vector of ones and whose other elements are xij. In this notation the model given in Equation 47 becomes T ⫽ X␤ ⫹ ⑀, where ⑀ has a k-variate normal distribution with mean 0 and covariance matrix V ⫽ Diag(v1, . . . , vk). The estimated regression coefficients are given by

␤ˆ ⫽ 共X⬘V⫺1 X兲⫺1 X⬘V⫺1 T,

(48)

which has a (p ⫹ 1) variate normal distribution with mean vector ␤ and covariance matrix ⌺ ⫽ 共X⬘V⫺1 X兲⫺1 .

Omnibus Tests It is frequently desirable to test the null hypothesis that the last q (q ⱕ p) of the regression coefficients are simultaneously 0, that is, that ␤p⫺q⫹1 ⫽ ␤p⫺q⫹2 ⫽ . . . ⫽ ␤p ⫽ 0. Such a test arises, for example, when it is desired to test whether the entire group of moderator variables (when q ⫽ p) or a subset of the moderator variables controlling for others (when q ⬍ p) are related to the effect sizes. Such a test uses the test statistic

HEDGES AND PIGOTT

440

Table 2 The Effects of Phonics Programs Under Various Conditions

Study

Effect size (Ti)

Effect size variance (vi)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

⫺0.17 0.38 0.23 0.02 1.92 1.63 0.56 ⫺1.11 0.20 ⫺0.08 0.53 0.93 0.56 0.08 0.52 0.07 2.94 1.30 0.04

0.03 0.03 0.01 0.09 0.07 0.08 0.06 0.19 0.17 0.17 0.16 0.09 0.04 0.06 0.06 0.11 0.13 0.16 0.14

Note.

Grade Kindergarten Kindergarten First grade First grade First grade First grade First grade First grade First grade First grade Kindergarten First grade Kindergarten Kindergarten Kindergarten First grade First grade First grade Kindergarten

Reading ability

Program

At risk Normal Normal At risk Normal At risk At risk Normal Normal Normal At risk At risk At risk At risk At risk At risk At risk At risk At risk

Synthetic Synthetic Synthetic Synthetic Synthetic Synthetic Large unit Miscellaneous Miscellaneous Miscellaneous Synthetic Large unit Synthetic Synthetic Large unit Synthetic Large unit Synthetic Miscellaneous

X matrix 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 0 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 0

1 0 0 1 0 1 1 0 0 0 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 1 1 1 1 0 1 0 0 1 0 1 0 1

These data are from Ehri, Nunes, Stahl, and Willows (2001).

⫺1 ˆ Q R ⫽ ␤ˆ q⬘⌺22 ␤q,

(49)

where ␤ˆ q ⫽ (␤ˆ p⫺q⫹1, ␤ˆ p⫺q⫹2, . . . , ␤ˆ p)⬘ and ⌺⫺1 22 is the q ⫻ q matrix obtained from by partitioning ⌺ ⫽ (X⬘V⫺1X)⫺1 as follows: ⌺ ⫽ 共X⬘V⫺1 X兲⫺1 ⫽

冉 ⌺⌺

11 12



⌺12 ⌺22 .

The statistic QR has the chi-square distribution with q degrees of freedom when the null hypothesis is true. When the null hypothesis is false, QR has the noncentral chi-square distribution with q degrees of freedom and noncentrality parameter ⫺1 ␭ R ⫽ ␤q⬘⌺22 ␤q ,

(50)

where ␤q ⫽ (␤p⫺q⫹1, ␤p⫺q⫹2, . . . , ␤p)⬘. The power of the test based on QR at significance level alpha is therefore 1 ⫺ H共c ␣兩q; ␭R兲,

(51)

where H(x兩v; ␭) is the cumulative distribution function of the noncentral chi-square with v degrees of freedom and noncentrality parameter ␭R. We next present an example of the fixed-effects omnibus test. The effect sizes and study characteristics from 19 studies of systematic phonics instruction examined by Ehri et al. (2001) are given in Table 2. The data includes three

predictors that might influence effect size: grade level (kindergarten or Grade 1), reading ability of students at pretest (normal or at risk), and type of phonics program (synthetic or other). We dummy coded each of these predictors so that kindergarten, normal ability, and synthetic programs were the reference categories. The last four columns of Table 2 are the design matrix X. Previous theory suggested that phonics programs would be more effective at Grade 1 by as much as 0.25 standard deviations and that they would be more effective for at-risk students by about the same amount but that there would be little difference between the types of phonics instruction. Comparing the sample effect sizes by values of each predictor in turn yields differences that are roughly consistent with these expectations (e.g., difference of 0.441, 0.426, and 0.095). Thus, we evaluate the power to detect ␤q ⫽ (0.25, 0.25, 0.0)⬘. We compute the value of ⌺ as ⌺⫽

冉 ⌺⌺

11 12

⌺12 ⌺22

冊 ⫽



We compute ␭R as



0.015 ⫺0.012 ⫺0.011 0.001 0.015 0.007 ⫺0.004 0.016 ⫺0.006 . 0.019

POWER OF TESTS FOR MODERATORS IN META-ANALYSIS

␭R ⫽



冉 冊 冉

0.25 ⬘ 0.25 0.00

0.015 0.007 ⫺0.004 0.007 0.016 ⫺0.006 ⫺0.004 ⫺0.006 0.019

冊冉 冊 ⫺1

0.025 0.025 0.00

⫽ 6.291.

441

equal to 1 ⫺ ⌽(c␣/2 ⫺ ␤j /公␴jj) ⫹ ⌽(⫺c␣/2 ⫺ ␤j /公␴jj) ⫽ 1 ⫺ ⌽[1.96 ⫺ (0.25)/公0.015] ⫹ ⌽[⫺1.96 ⫺ (0.25)/ 公0.015] ⫽ 1 ⫺ ⌽(⫺0.081) ⫹ ⌽(⫺4.00) ⫽ 1 ⫺ 0.47 ⫹ 0.00 ⫽ 0.53. As expected, the two-sided test has less power than the one-sided test.

Tests of Goodness of Fit of the Regression Model

The 95% point of the central chi-square distribution with 3 degrees of freedom is 7.815. The power of the omnibus test is therefore 1 ⫺ H(7.815兩3; 6.291) ⫽ 1 ⫺ 0.46 ⫽ 0.54.

Tests of Individual Regression Coefficients The test of the hypothesis that ␤j ⫽ 0 uses the test statistic Zj ⫽ ␤ˆ j /公␴jj, where ␴ij is the variance (square of the standard error) of ␤ˆ j, given by jth diagonal element of the matrix ⌺ ⫽ (X⬘V⫺1X)⫺1. The statistic Zj is normally distributed with mean ␤j/公␴jj (which, of course, is 0 if the null hypothesis that ␤j ⫽ 0 is true) and variance 1. The one-sided test of the hypothesis that ␤j ⫽ 0 at level ␣ rejects if

The weighted sum of squares about the regression line is often used as a test of goodness of fit of the fixed-effects regression model for effect sizes. Formally, this is a test of the hypothesis that the model given in Equation 46 holds versus the alternative that at least one of the ␪i is not determined by the linear model. The test uses the statistic Q E ⫽ T⬘共V⫺1 ⫺ V⫺1 X共X⬘V⫺1 X兲⫺1 X⬘V⫺1 兲T,

(54)

which has the chi-square distribution with (k ⫺ p ⫺ 1) degrees of freedom if the model in Equation 46 holds. When the model does not hold, QE has the noncentral chi-square distribution with (k ⫺ p ⫺ 1) degrees of freedom and noncentrality parameter

Z j ⫽ ␤ˆ j /兹 ␴ jj, ⬎c␣ ,

␭ E ⫽ ␪⬘共V⫺1 ⫺ V⫺1 X共X⬘V⫺1 X兲⫺1 X⬘V⫺1 兲␪,

where c␣ is the 100(1 ⫺ ␣) percent critical value of the standard normal distribution. Therefore, the power of the one-sided test is given by

where ␪ ⫽ (␪1, ␪2, . . . , ␪k)⬘ is the k ⫻ 1 vector of effect size parameters. The power of the test based on QE at significance level alpha is therefore

1 ⫺ ⌽共c ␣ ⫺ ␤ j /兹 ␴ ii兲,

1 ⫺ H共c ␣兩k ⫺ p ⫺ 1; ␭ E兲,

(52)

where ⌽(x) is the standard normal cumulative distribution function. Computation of power of the two-sided test is slightly more complicated. Because the two-sided test of the hypothesis that ␤j ⫽ 0 at level ␣ rejects when 兩Zj兩 ⬎ c␣/2, that is, if Zj ⬎ c␣/2 or if Zj ⬍ ⫺c␣/2, the power of the two-sided test is given by 1 ⫺ ⌽共c ␣/ 2 ⫺ ␤ j /兹 ␴ ii兲 ⫹ ⌽共⫺c␣ / 2 ⫺ ␤j /兹␴ii 兲.

(53)

We next present an example of tests of individual regression coefficients. Return to the phonics instruction data in Table 2, and consider the power of the ␣ ⫽ .05 level test that the individual regression coefficient for the effect of grade is zero (i.e., ␤grade ⫽ ␤1 ⫽ 0). As in the case of the omnibus test discussed earlier, we compute the power to detect a true effect of ␤1 ⫽ 0.25 (indicating that the difference in effect between kindergarten and first-grade children is about one fourth of a standard deviation) when there is moderate heterogeneity. Let us take ␣ ⫽ .05, and the estimated variance of ␤ˆ grade as 0.015, the second diagonal element of ⌺. The power of the one-sided test is given by 1 ⫺ ⌽(c␣ ⫺ ␤j /公␴jj) ⫽ 1 ⫺ ⌽[1.645 ⫺ (0.25)/公0.015] ⫽ 1 ⫺ ⌽[1.645 ⫺ (2.041)] ⫽ 1 ⫺ ⌽(⫺0.396) ⫽ 1 ⫺ 0.35 ⫽ 0.65. For the two-sided test with ␣ ⫽ .05, we have power

(55)

(56)

where H(x兩v; ␭) is the cumulative distribution function of the noncentral chi-square with v degrees of freedom and noncentrality parameter ␭. Choosing values for the noncentrality parameter. The statistical power of the test of goodness of fit of the regression model requires the value of ␭E, the noncentrality parameter of substantive interest. This depends on the design matrix X, the sampling (conditional) variance of each study (via the covariance matrix V), and the effect size parameters of each study, which may be difficult to guess in the preliminary stages of a meta-analytic study. If the variances vij of the individual effect size estimates are identical to v or nearly so, then one might treat the expression for ␭E as (k ⫺ p ⫺ 1)/v times the “variance” of the ␪j values about the regression plane and thus ␭E is (k ⫺ p ⫺ 1) times the ratio of the excess “variance” to the within-study variance. The conventions suggested by Hedges and Pigott (2001), and that we have also used in this article, would correspond to ␭E ⫽ (k ⫺ p ⫺ 1)/3 as a small degree of heterogeneity, ␭E ⫽ 2(k ⫺ p ⫺ 1)/3 as a moderate degree of heterogeneity, and ␭E ⫽ (k ⫺ p ⫺ 1) as a large degree of heterogeneity. Example of goodness-of-fit test for regression models. Returning to the phonics instruction data in Table 2, we have k ⫽ 19 studies and p ⫽ 3 predictors, and we compute

HEDGES AND PIGOTT

442

the power of the test for goodness of fit at the ␣ ⫽ .05 significance level. With a small degree of heterogeneity, we have a noncentrality parameter equal to ␭E ⫽ (19 ⫺ 4)/3 ⫽ 5.0. The 95% critical value for the central chi-square with 15 degrees of freedom is 25.0. The power of the goodnessof-fit test with a small degree of heterogeneity is given by 1 ⫺ H(c␣兩19 ⫺ 3 ⫺ 1; 5.00) ⫽ 1 ⫺ H(25.0兩15; 5.00) ⫽ 1 ⫺ 0.78 ⫽ 0.22. Thus, there is very little power to detect model misfit with a small degree of heterogeneity.

Mixed-Model Procedures Let T1, T2, . . . Tk be estimates of the effect size parameters ␪1, ␪2, . . . , ␪k from k independent studies having known standard errors 公v1, 公v2, . . . , 公vk, as in the fixed-effects model, and suppose that each effect size estimate is approximately normally distributed so that the conditional distribution of Ti given ␪i is

express the model and estimates in matrix form in terms of the k ⫻ 1 vector of effect size estimates T ⫽ (T1, . . . , Tk)⬘, the k ⫻ 1 vector of residuals ␩ ⫽ (␩1, . . . , ␩k)⬘, a (p ⫹ 1) ⫻ 1 vector of regression coefficients ␤ ⫽ (␤0, . . . , ␤p)⬘, and the k ⫻ (p ⫹ 1) design matrix X (assumed to be of full rank) whose first column is a vector of ones and whose other elements are xij. In this notation the model given in Equation 58 becomes T ⫽ X␤ ⫹ ␩,

where ␩ has a k-variate normal distribution with mean 0 and covariance matrix V ⫽ Diag(v*1, . . . , v*k). Whereas the vi are known, ␶2 is generally unknown and must be estimated from the data. Estimation of ␶2 uses the same principles as estimation of variance components in ANOVA. A direct argument gives the following unbiased estimator of ␶2:

T i ⬃ N共␪i , vi 兲.

␶ˆ 2 ⫽

Suppose further that the effect size parameters are linearly related to p moderator variables X1, X2, . . . , Xp, but unlike the fixed-effects case, in the mixed-model case the moderator variables do not explain all of the variation in the effect size parameters but are related via

(60)

Q E ⫺ 共k ⫺ p ⫺ 1兲 , a

(61)

where QE is the weighted residual sum of squares from the fixed-effects analysis given in Equation 54 and a is a constant given by

冘 w ⫺ tr关共X⬘V k

␪ i ⫽ ␤ 0 ⫹ ␤ 1x i1 ⫹ ␤ 2x i2 ⫹ · · · ⫹ ␤ px ip ⫹ ␰ i,

(57)

a⫽

i

⫺1

X兲⫺1 X⬘V⫺2 X兴,

(62)

i⫽1

where xij is the value of the ith study on the jth moderator variable, ␤0, . . . , ␤p are unknown regression coefficients, and ␰i is a study-specific random effect with expectation 0 and variance ␶2. Alternatively, we might say that T i ⫽ ␤ 0 ⫹ ␤ 1x i1 ⫹ ␤ 2x i2 ⫹ · · · ⫹ ␤ px ip ⫹ ␩ i,

(58)

where ␩i ⫽ ␰i ⫹ ⑀i is a combined random component, with ␰i ⬃ N(0, ␶2), ⑀ij ⬅ Ti ⫺ ␪i, ⑀i ⬃ N(0, vi), and therefore ␩i ⬃ N(0, ␶2 ⫹ vi). Thus, one might view ␶2 as quantifying the amount of excess residual variation over and above sampling error variation of Ti about ␪i. For this reason ␶2 is often called the residual variance component. The regression coefficients in the model given in Equation 58 are usually estimated by considering Ti to have residual variation v *i ⫽ v i ⫹ ␶ 2.

where the wi are the fixed effects weights, V ⫽ Diag(v1, . . . , vk) is the conditional covariance matrix of the effect size estimates given in the section on fixed-effects models, and tr(Z) is the trace of a square matrix Z. The mixed model estimates of ␤ are computed using the covariance matrix V* of ␩, namely, V*⫽Diag共v1 ⫹ ␶ˆ 2 , v2 ⫹ ␶ˆ 2 , . . . , vk ⫹ ␶ˆ 2 兲. The estimated regression coefficients are given by

␤ˆ * ⫽ 关X⬘共V*兲⫺1 X兴⫺1 X⬘V*T,

(64)

which has a (p ⫹ 1) variate normal distribution with mean vector ␤ and covariance matrix ⌺* ⫽ 关X⬘共V*兲⫺1 X兴⫺1 .

(59)

Methods of estimation in mixed-effects statistical procedures for meta-analysis typically involve estimating ␶2 and then computing weights that are the reciprocals of v*j. The methods usually used to estimate ␶2 are method of moments estimators analogous to those used to estimate variance components in the ANOVA (see, e.g., Hedges, 1992; Raudenbush, 1994). As in the case of fixed-effects models, it is easiest to

(63)

Omnibus Tests To test that the last q (q ⱕ p) of the regression coefficients are simultaneously 0, that is, that ␤p⫺q⫹1 ⫽ ␤p⫺q⫹2 ⫽ . . . ⫽ ␤p ⫽ 0, use the test statistic ⫺1* ˆ Q *R ⫽ ␤ˆ *q ⌺22 ␤*q ,

(65)

where ␤ˆ *q ⫽ (␤ˆ *p⫺q⫹1, ␤ˆ *p⫺q⫹2, . . . , ␤ˆ *p)⬘ and ⌺⫺1* is the 22

POWER OF TESTS FOR MODERATORS IN META-ANALYSIS

q ⫻ q matrix obtained by partitioning ⌺* ⫽ [X⬘(V*)⫺1X]⫺1 as follows: ⌺* ⫽ 关X⬘共V*兲⫺1 X兴⫺1 ⫽

冉 ⌺⌺

11* 12*



⌺12* ⌺22* .

⫺1 ␭ *R ⫽ ␤q⬘⌺22 *␤q ,

(66)

where ␤q ⫽ (␤p⫺q⫹1, ␤p⫺q⫹2, . . . , ␤p)⬘. Thus, the power of the test based on Q*R at significance level ␣ is 1 ⫺ H共c ␣兩q; ␭*R兲,

(67)

where H(x兩␯; ␭) is the cumulative distribution function of the noncentral chi-square with ␯ degrees of freedom and noncentrality parameter ␭. Conventions on heterogeneity. The power of tests for blocks of regression coefficients and individual coefficients depends on the conditional variances v1, . . . , vk and ␶2 through the matrix V*. If we use the conventions on heterogeneity discussed earlier, then if v is the typical or average sampling variance of effect sizes in the meta-analysis, then ␶2 ⫽ v/3 would be a small amount of heterogeneity, ␶2 ⫽ 2v/3 would be a moderate amount of heterogeneity, and ␶2 ⫽ v would be a large degree of heterogeneity. Thus, the matrix V* may be constructed for small, moderate, or large amounts of heterogeneity by adding v/3, 2v/3, or v, respectively, to each diagonal element of V. Example of omnibus test for the mixed-effects regression model. Return to the example of the phonics instruction data given in Table 2, and suppose we wish to compute the power of the mixed-effects test that ␤q ⫽ 0 when the value of ␤q is actually ␤q ⫽ (0.25, 0.25, 0.00)⬘. To compute the noncentrality parameter ␭*R, we must choose a value of ␶2 to compute V* and then ⌺* and ⌺22 in turn. Suppose that we decided that there was a moderate degree of heterogeneity. The average of the vi in these data is about v ⫽ 0.10, so a moderate degree of heterogeneity would make ␶2 ⫽ 0.067. Inserting this value for ␶2 into Equation (62) yields the values of v*i for each study (the values of the ith diagonal element of V*). Using X and this value of V* to compute ⌺* yields





0.043 ⫺0.025 ⫺0.030 ⫺0.004 ⫺0.025 0.036 0.010 0.008 ⫺0.030 0.010 0.038 ⫺0.005 . ⫺0.004 ⫺0.008 ⫺0.005 0.036

Then we compute ␭*R using Equation 66 as

冉 冊 冉



The statistic Q *R has the chi-square distribution with q degrees of freedom when the null hypothesis is true. When the null hypothesis is false, Q*R has the noncentral chisquare distribution with q degrees of freedom and noncentrality parameter

⌺* ⫽

␭ *R ⫽

0.25 ⬘ 0.25 0.00 0.036 0.010 ⫺0.008 0.010 0.038 ⫺0.005 ⫺0.008 ⫺0.005 0.036

443

冊冉 冊 ⫺1

0.25 0.25 0.00

⫽ 2.804.

The 95% critical value for a chi-square distribution with q ⫽ 3 degrees of freedom is 7.815. Thus, the power to test whether ␤q is different from zero at the ␣ ⫽ .05 significance level is given by Equation 67 as 1 ⫺ H(7.815兩3; 2.804) ⫽ 1 ⫺ 0.742 ⫽ 0.258. Note, however, that the actual data exhibit a very large heterogeneity due to three outliers. Using the data in Table 2, we find a value of QE ⫽ 126.02 and a ⫽ 214.84; thus, ␶ˆ 2 is computed as [126.02 ⫺ (19 ⫺ 3 ⫺ 1)]/214.84 ⫽ 0.52. Thus, the actual power would be much lower than computed given the (faulty) assumption of only moderate heterogeneity among studies. This example illustrates the importance of checking assumptions about heterogeneity whenever possible.

Tests of Individual Regression Coefficients The test of the hypothesis that ␤j ⫽ 0 uses the test statistic Z*j ⫽ ␤ˆ *j /公␴*jj, where ␴*ij is the variance (square of the standard error) of ␤ˆ *j, given by the jth diagonal element of the matrix ⌺* ⫽ [X⬘(V*)⫺1X]⫺1. The statistic Z*j is normally distributed with mean ␤j /公␴*jj (which, of course, is 0 if the null hypothesis that ␤j ⫽ 0 is true) and variance 1. The one-sided test of the hypothesis that ␤j ⫽ 0 at level ␣ rejects if Z *j ⫽ ␤ˆ *j /兹 ␴ *jj ⬎ c ␣, where c␣ is the 100(1 ⫺ ␣) percent critical value of the standard normal distribution. Therefore, the power of the one-sided test is given by 1 ⫺ ⌽共c ␣ ⫺ ␤ j /兹 ␴ *ii兲,

(68)

where ⌽(x) is the standard normal cumulative distribution function. Computation of power of the two-sided test is only slightly more complicated. Because the two-sided test of the hypothesis that ␤j ⫽ 0 at level alpha rejects when 兩Z*j兩 ⬎ c␣/2, that is, if Z*j ⬎ c␣/2 or if Z*j ⬍ ⫺c␣/2, the power of the two-sided test is given by 1 ⫺ ⌽共c ␣/ 2 ⫺ ␤ j /兹 ␴ *ii兲 ⫹ ⌽共⫺c␣ / 2 ⫺ ␤j /兹␴*ii 兲.

(69)

We now present an example of tests of individual regression coefficients. Return to the phonics program data in Table 2 and consider the power of the ␣ ⫽ .05 level mixed-model test that the effect of grade is zero (i.e.,

HEDGES AND PIGOTT

444

␤grade ⫽ ␤1 ⫽ 0), when the actual value is ␤1 ⫽ 0.25 (indicating that the true difference in effect between kindergarten and first-grade children is about one fourth of a standard deviation). Suppose, as before, that there was a moderate degree of heterogeneity so that ␶2 ⫽ 0.67. Let us take ␣ ⫽ .05, and the estimated variance of ␤ˆ *1 as 0.036 the second diagonal element of ⌺*. The power of the one-sided test is given by 1 ⫺ ⌽(c␣ ⫺ ␤j /公␴jj) ⫽ 1 ⫺ ⌽[1.645 ⫺ (0.25)/公0.036] ⫽ 1 ⫺ 0.62 ⫽ 0.38. For the two-sided test with ␣ ⫽ .05, we have power equal to 1 ⫺ ⌽(c␣/2 ⫺ ␤j /公␴jj) ⫹ ⌽(⫺c␣/2 ⫺ ␤j /公␴jj) ⫽ 1 ⫺ ⌽(1.96 ⫺ (0.25)公0.036) ⫹ ⌽(⫺1.96 ⫺ (0.25)/公0.036) ⫽ 0.26. As expected, the two-sided test has less power than a one-sided test, and neither of these tests has high enough power to merit recommendation of their use.

Tests of the Residual Variance Component The test that the residual variance component ␶ ⫽ 0 uses the same test statistic QE given in Equation 54 as does the test of goodness of fit of the fixed-effects regression model for effect sizes. As in the case of the test for the residual variance component in the analogue to ANOVA, the test statistic has the same null distribution in both the fixedeffects and random effects models. However, the sampling distribution of QE is different in the mixed effects model than in the fixed-effects model. Under the mixed-model assumptions, as in the fixedeffects case, the test statistic QE has a chi-square distribution with (k ⫺ p ⫺ 1) degrees of freedom when the null hypothesis that ␶2 ⫽ 0 is true. When the null hypothesis of homogeneity is false, that is, when ␶2 ⬎ 0, and the conditional variances are all equal, that is, v1 ⫽ v2 ⫽ . . . ⫽ vk ⫽ v, then QE has a distribution that is (v ⫹ ␶2)/v times a central chi-square distribution with (k ⫺ p ⫺ 1) degrees of freedom, so the power of the test that ␶2 ⫽ 0 is

where a is defined as in Equation 62 and M is a k ⫻ k matrix given by M ⫽ V⫺1 X共X⬘V⫺1 X兲⫺1 X⬘V⫺1 .

(73)

If we use the chi-square distribution with noninteger degrees of freedom to evaluate the gamma distribution, the power of the test that ␶2 ⫽ 0 is given by 1 ⫺ H共c ␣/r兩s; 0兲,

(74)

where c␣ is the 100(1 ⫺ ␣) percentile point of the (central) chi-square distribution with (k ⫺ p ⫺ 1) degrees of freedom, r ⫽ ␴ Q2 / 2 ␮ Q2

(78)

s ⫽ 2共 ␮ Q兲 2/ ␴ Q2 .

(79)

and

2

1 ⫺ H关c ␣v/共v ⫹ ␶ 2兲兩k ⫺ p ⫺ 1; 0兴,

(70)

where H(x兩␯; 0) is the cumulative distribution of the central chi-square distribution with ␯ degrees of freedom. When the conditional variances are unequal, QE has a distribution of rather complex form (a weighted sum of chi-square distributions) that is not tabulated. However, an approximation to that distribution that is adequate for estimating statistical power is known (Satterthwaite, 1946). It approximates the distribution of QE by a gamma random variable with mean and variance equal to that of QE. The mean ␮Q under this model is

␮ Q ⫽ a ␶ 2 ⫹ 共k ⫺ p ⫺ 1兲,

We now present an example of the test of the residual variance component. Returning to the phonics program data in Table 2, we now calculate the power of the ␣ ⫽ .05 level test that the residual variance component is 0 using Satterthwaite’s (1946) approximation. The 95% critical value of the central chi-square distribution with 15 degrees of freedom is 25.0. Recall from the example of the omnibus test that the value of the constant a in Equation 62 is a ⫽ 214.84. Assume, as before, that there is a moderate degree of heterogeneity (in this example that corresponds to ␶2 ⫽ 0.067). Using this value of ␶2 to create the V* matrix, we compute ␮Q ⫽ (214.84)(0.067) ⫹ (19 ⫺ 3 ⫺ 1) ⫽ 29.394, 2 and ␴Q ⫽ 128.431. We compute the constants r and s as r ⫽ 2 ␴Q/[2(␮Q)] ⫽ 128.431/[2(29.394)] ⫽ 2.185 and s ⫽ 2(29.394)2/128.431 ⫽ 13.455. Using Equation (77) the power to detect heterogeneity of ␶2 ⫽ 0.067 is p ⫽ 1 ⫺ H(25.00/2.185兩13.455; 0) ⫽ 1 ⫺ H(11.442兩13.455; 0) ⫽ 1 ⫺ 0.39 ⫽ 0.61.

(71)

2 and the variance ␴Q of QE is given by

␴ Q2 ⫽ 2兵tr关V⫺2 共V*兲2 兴 ⫺ 2tr关MV⫺1 共V*兲2 兴 ⫹ tr关MV*MV*兴其, (72)

Conclusions Analysis of the power of statistical tests to be conducted is an important part of planning any scientific research, including meta-analyses. Although the statistical power of tests in meta-analysis is often high, this need not always be so, as demonstrated by the examples in this article— examples that were taken from the published literature. As we demonstrate, the power of tests in fixed-effects analyses can be rather low. If there is substantial heterogeneity of effects, the statistical power of tests in mixed models can be even lower. Power analyses are particularly important in aiding in the interpretation of moderator analyses. Often these analyses are used to rule out the hypothesis that effect sizes are systematically associated with study design or population characteristics. In such applications, it is crucial to know

POWER OF TESTS FOR MODERATORS IN META-ANALYSIS

whether the moderator tests conducted would likely detect the effects if they were present. In the absence of adequate statistical power, such moderator tests may be misleading. Therefore, we advocate the computation of statistical power before conducting important moderator tests designed to rule out association between effect size and study characteristics. If power is found to be low, we recommend either that the tests for moderator effects not be conducted at all or that the power calculated be explicitly reported to aid in the interpretation of results. This article also demonstrates that the power of tests of goodness of fit in fixed-effects models or residual variance components in mixed models is not always high. This suggests that it is unwise to rely on the results of tests for residual variance components in making analytic choices (such as whether to use fixed-effects or mixed-model procedures). However, interpretation of failures to reject the null hypothesis can be strengthened by demonstration that the power of the tests is high to detect the smallest heterogeneity that may be important.

References Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 1173–1182. Cohen, J. (1977). Statistical power analysis for the behavioral sciences (2nd ed.). New York: Academic Press. Cooper, H., & Hedges, L. V. (Eds.). (1994). The handbook of research synthesis. New York: Russell Sage Foundation. Cronbach, L. J., & Snow, R. (1981). Aptitudes and instructional methods (2nd ed.). New York: Irvington. DerSimonian, R., & Laird, N. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7, 177–188. Ehri, L. C., Nunes, S., Stahl, S., & Willows, D. (2001). Systematic phonics instruction helps students learn to read: Evidence from the National Reading Panel’s meta-analysis. Review of Educational Research, 71, 393– 448. Fleiss, J. L. (1994). Measures of effect size for categorical data. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 245–260). New York: Russell Sage Foundation. Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Educational Researcher, 5, 3– 8. Harwell, M. (1997). An empirical study of Hedges’s homogeneity test. Psychological Methods, 2, 219 –231. Hedges, L. V. (1982a). Fitting categorical models to effect size

445

from a series of experiments. Journal of Educational Statistics, 7, 119 –137. Hedges, L. V. (1982b). Fitting continuous models to effect size data. Journal of Educational Statistics, 7, 245–270. Hedges, L. V. (1992). Meta-analysis. Journal of Educational Statistics, 17, 279 –296. Hedges, L. V., & Olkin, I. (1985). Statistical models for metaanalysis. New York: Academic Press. Hedges, L. V., & Pigott, T. D. (2001). The power of statistical tests in meta-analysis. Psychological Methods, 6, 203–217. Hedges, L. V., & Vevea, J. L. (1996). Estimating effect size under publication bias: Small sample properties and robustness of a random effects selection model. Journal of Educational and Behavioral Statistics, 21, 299 –333. Kenny, D. A., Kashy, D. A., & Bolger, N. (1998). Data analysis in social psychology. In D. T. Gilbert, S. T. Fiske, & G. Lindzey (Eds.), Handbook of social psychology (pp. 233–265). New York: McGraw-Hill. Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage. Raudenbush, S. W. (1994). Random effects models. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 301–321). New York: Russell Sage Foundation. Rosenthal, R. (1994). Parametric measures of effect size. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 231–244). New York: Russell Sage Foundation. SAS Institute. (1990). SAS language: Reference [Software manual]. Cary, NC: Author. Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components. Biometrics Bulletin, 2, 110 –114. Schmidt, F. (1992). What do data really mean? Research findings, meta-analysis and cumulative knowledge in psychology. American Psychologist, 47, 1173–1181. Schmidt, F., & Hunter, J. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62, 529 –540. SPSS. (1999). SPSS Base 10.0 applications guide [Software manual]. Chicago: Author. Thomas, L. (1997). Retrospective power analysis. Conservation Biology, 11, 276 –280. West, S. G., Aiken, L. S., & Krull, J. L. (1996). Experimental personality designs: Analyzing categorical by continuous variable interactions. Journal of Personality, 64, 1– 48.

Received February 13, 2004 Revision received July 12, 2004 Accepted July 15, 2004 

Suggest Documents