Validity Generalization and Situational Specificity: A Second Look at the ... (h) the use of the Fisher's z transformation of r in validity generalization studies and.
Journal of.Applied Psychology 1988, Vol. 73, No. 4,665-672
Copyright 1988 by the American Psychological Association, Inc. 0021-9010/88/S00.75
Validity Generalization and Situational Specificity: A Second Look at the 75% Rule and Fisher's z Transformation Frank L. Schmidt
John E. Hunter Michigan State University
College of Business Administration University of Iowa Nambury S. Raju Illinois Institute of Technology
In this article we analyzed the James, Demaree, and Mulaik (1986) critique of validity generalization. We demonstrated that the James et al. article (a) is not relevant to the real-world use of validity generalization in organizations, (b) has overlooked the bulk of the evidence against the situational specificity hypothesis and, therefore, the substantive conclusion that the situational specificity hypothesis is "alive and well" cannot be supported, and (c) has confused the processes of hypothesis testing and parameter estimation in validity generalization and has made incorrect statements about the assumptions underlying both. In addition, (d) James et al.'s critique of the 75% rule is a statistical power argument and, as such, does not add to earlier statistical power studies; (e) the procedures for use of confidence intervals that they advocate are erroneous; (f) there is no double correction of artifacts in validity generalization, as they contend; (g) the bias in the correlation (r) and the sampling error formula for r that they discuss is well-known, trivial in magnitude, and has no empirical significance; and (h) the use of the Fisher's z transformation of r in validity generalization studies and other meta-analyses (which they advocate) creates an unnecessary inflationary bias in estimates of true validities and provides no benefits. In light of these facts, we conclude that the James et al. substantive conclusions and methodological recommendations are seriously flawed.
This article is an analysis of the James, Demaree, and Muliak
in validity across settings were accepted (i.e., if one concluded
(1986) critique of validity generalization methods and conclu-
there was some cross-situational inconsistency), local validity
sions, a long and detailed article. In the interests of brevity, we
studies would typically be unable to detect these differences.
will focus only on those portions of James et al. that we judge
The statistical power of local studies is generally too low for the
to be most in need of critical evaluation. Their remaining argu-
detection of even the presence of validity (Schmidt, Hunter, &
ments are left to the reader to evaluate in light of the analysis presented here. Editor's Note. The introduction of new ideas into the literature is often followed by a polarization of opinion; some hail the contribution as a major advance and others denounce it as a false trail. Throughout my term as Editor of the Journal of Applied Psychology, I have found that polarization and the accompanying emotion seem to be especially strong when the contribution concerns data analysis. Not only authors but reviewers take positions that—if one is familiar with their earlier positions—are predictable in direction and strong in tone. In general, I have tried to avoid sequences of notes and rejoinders. These sequences are sometimes useful, of course. With regard to validity generalization (VG), I once suggested that a period of Hegelian dialectic was needed to sort out the best from the worst of VG arguments, equations, and procedures. Frank Schmidt, as is his wont, gently and correctly chided me, saying what was needed was fine-tuning, not debate. I agreed, but the fact of debate continues to characterize many discussions. The present article and the one by James, Demaree, Mulaik, and Mumford that follows it represent the continuing VG debate. For these articles and for subsequent exchanges like this one on other topics, I urge readers to follow Bacon's advice: "Read not to contradict and confute, nor to believe and take for granted, but to weigh and consider." With less wisdom, perhaps, but with more direct relevance, my paraphrase is read not to be entertained or to take sides, but to broaden perspective.—RMG
No Implications for Use of Validity Generalization The James et al. (1986) article has no implications for the actual use of validity generalization findings in applied settings. It focused solely on methods for testing the situational specificity hypothesis (Schmidt, Ocasio, Hillery, & Hunter, 1985), referred to by them in reverse as the cross-situational consistency hypothesis. Applied use of validity generalization (VG) findings does not depend on demonstrations of an absence of situational specificity; it requires only a showing that the bulk of the validities (e.g., 90%) lie in the positive range. Furthermore, even if the conclusion that a generalizably valid test shows some variation
A longer, more detailed version of this article is available from the first author. The authors would like to thank Hannah Rothstein for her helpful comments on earlier versions of this article. Correspondence concerning this article should be addressed to Frank L. Schmidt, Department of Industrial Relations and Human Resources, Phillips Hall, University of Iowa, Iowa City, Iowa 52242.
665
666
F. SCHMIDT, J. HUNTER, AND N. RAJU
Urry, 1976; Schmidt, Hunter, et al., 1985, Q&A 1); it is substantially lower for the detection of differences between settings in nonzero positive validities. James et al. overlook this fact. Most Evidence Against Situational Specificity Overlooked Even considered solely as an analysis or critique of evidence bearing on the situational specificity hypothesis and methods for testing, the article has deficiencies. Perhaps the most serious is that James et al. (1986) ignored most of the published evidence against the situational specificity hypothesis for cognitive ability tests. The authors assumed that there is only one method for testing this hypothesis—use of the 75% rule in individual meta-analyses—and, therefore, only one line of evidence against the hypothesis: the results of such tests. However, hypotheses in science are virtually never accepted or rejected on the basis of only one kind of evidence. Rather, such issues are decided by the total pattern or configuration of available evidence. In the case of situational specificity for measures of cognitive abilities, there are three different converging lines of evidence—none dependent on the 75% rule—indicating that true situational variance is zero. These lines of evidence are described in Schmidt, Hunter, et al. (1985, Q&A 27). Implicit in the James et al. article is the assumption that the situational specificity hypothesis is to be evaluated in isolation separately for each VG analysis. Just as earlier researchers have focused on the individual validity study, failing to see that single studies cannot be interpreted in isolation, James et al. focused on the single VG study. They assumed that a judgment should be made about the existence of situational specificity for each VG analysis, independently of other VG studies. They did not recognize that it is the higher level pattern of findings across numerous VG analyses that is important in revealing the underlying reality. James et al. did not consider or challenge the multiple converging lines of evidence against situational specificity; rather they omitted any mention of this evidence. The result is their unjustified conclusion that "the hypothesis of situational specificity is alive and well" (James et al., 1986, p. 445). In light of the multiple converging lines of evidence, the 75% rule has become essentially irrelevant. The question of situational specificity has been decided by the total cumulative evidence. The same cumulative process can be expected to take place in areas of investigation other than the validity of cognitive tests as evidence accumulates (although at present we do not know the final outcome in such areas). Confusion of Hypothesis Testing and Parameter Estimation As was pointed out by Callender (personal communication, May 1,1986), the James et al. (1986) article failed to distinguish between hypothesis testing (i.e., use of the 75% rule, the chisquare test, or computer sampling simulation to choose between the hypotheses of constant vs. variable true validity) and parameter estimation (i.e., estimation of the mean and standard deviation of true validity). In hypothesis testing, one compares parameter estimation results (in particular, the estimate ofSDf)
with the value expected given the null hypothesis—not the assumption—that p is constant across studies. That is, the null hypothesis that p is constant is tested. In parameter estimation and, in particular, in estimating SDf, there is no hypothesis that p is constant across studies, nor is there any hypothesis or assumption that p is variable across studies. In parameter estimation, all VG equations (Callender & Osburn, 1980; Raju & Burke, 1983; Schmidt, Hunter, Pearlman, & Shane, 1979) treat p as a random variable that is free to take on any distribution of values. Callender and Osburn (1980, pp. 556-558) provided an explicit explanation and mathematical demonstration of these properties of validity generalization. James et al. treated validity generalization as if the hypothesis testing component were the only component, and then mistakenly implied that both the hypothesis testing and the estimation of the true validity mean and standard deviation are based on the assumption that true validity is constant. For example, James et al. (1986, p. 441) stated, "The key 'what if assumption [italics added] for the validity generalization procedure is: What if the population correlation is assumed [italics added] to be the same over studies (cf. Callender & Osburn, 1980; Hunter, Schmidt, & Jackson, 1982)?" Neither hypothesis testing nor parameter estimation is based on an assumption that p is constant. Hypothesis testing tests the hypothesis that p is constant. In parameter estimation, there is not even the hypothesis that p is constant. The 75% Rule and Statistical Power The major criticism by James et al. (1986) of the 75% rule is that it is "too powerful" for rejecting situational specificity. This is the same as saying that it has low statistical power to detect moderators. Alternative decision rules, based on statistical tests, are available for use. Computer simulation studies show that these alternative decision rules (e.g., the chi-square test) do not have generally higher statistical power than the 75% rule (Osburn, Callender, Greener, & Ashworth, 1983; Sackett, Harris, & Orr, 1986). Under some circumstances, the 75% rule has higher power; under others, lower power. After their critique of the 75% rule, James et al. offered no alternative decision rule or test that they advocate as superior. Their comments do not appear to add to previous studies of statistical power in validity generalization and meta-analysis (Osburn et al., 1983; Sackett et al., 1986). On the basis of multiple lines of evidence (Schmidt, Hunter, et al., 1985, Q&A 27), validity generalization studies have concluded that there is no situational specificity for cognitive ability tests. James et al. (1986, p. 441) criticized this conclusion on grounds that it implies "irrefutable evidence" against situational specificity and "leave[s] little room for the possibility that future tests based on different assumptions might disconfirm cross-situational consistency." In this connection, they advanced two propositions. First, the empirical analyses that are the basis for the conclusion "usually involve untested assumptions that may be incorrect" (p. 441). In their analysis of "untested assumptions," James et al. focused only on the 75% decision rule, ignoring the other kinds of evidence against situational specificity. Thus, even if their critique of untested assumptions were correct, it would not demonstrate that our
A SECOND LOOK conclusion about situational specificity is inappropriate. Sec-
667
small data set such as this with the situational specificity hy-
ond, James et al. stated that "a set of observed data may be
pothesis would have little evidential value against the total pat-
explained equally well by more than one causal theory" (p.
tern of evidence against situational specificity cumulated across
441). If this sentence means that alternate causal theories can
many data sets.
always be found that provide equally good explanations, it is incorrect. If it means that in some cases alternate causal theo-
Their Confidence Interval Procedures Are Erroneous
ries provide equally good explanations, it is true as a general principle but is irrelevant here, because situational specificity
James et al. (1986) concluded that their artificial data set is
is not such a case. In our judgment, the combined effect of the
consistent with a range of p, (population p/ correlations) from
different lines of evidence against situational specificity is to
.04 to .84. In reaching this conclusion, they started with the
rule out "alternative explanatory models." That is, the alterna-
assumption that each observed validity might represent a
tive explanatory model of situational specificity is incapable of
different underlying value of pt. They then placed confidence
explaining the total pattern of cumulative evidence; only the
intervals around the largest and smallest observed validities; the
theory that there is no situational specificity fits all the different
endpoints of their combined confidence interval for all values
kinds of data (Schmidt, Hunter, et al., 1985, Q&A 27). Thus,
of pi were then the lower bound of the confidence interval
the wording of our conclusion appropriately "leave[s] little
around the smallest observed validity and the upper bound of
room for the possibility that future tests. . . might disconfirm
their confidence interval around the largest observed validity.
cross-situational consistency" (James et al., 1986, p. 441). The
They presented this as a legitimate method of data analysis
total pattern of available evidence indicates that there is little
when the researcher hypothesized that the p, vary (i.e., hypothe-
such room. Of course, there is no such thing in science as irre-
sizes the possible presence of situational specificity). The incon-
futable evidence; therefore, it can never be required as the basis
sistency in their own numbers for the hypothetical data in their
of a scientific conclusion. The phrase irrefutable evidence ap-
Table 1 is revealing. According to their confidence interval
pears to be a straw man. A strong scientific conclusion never
method, the 95% range of p should be .04 to .84, a band of width
implies irrefutable evidence. It implies only that the over-
.80. However, they also presented a meta-analysis of Fisher's z
whelming preponderance of evidence favors one hypothesis and
values for these data. Their meta-analysis (p. 448) indicates that
is contrary to competing hypotheses—the case here. Under
population zs have a mean of .56 and a standard deviation of
such circumstances, it is scientific practice to accept provision-
.08 (i.e., [.021 - .0149]1/2). The corresponding 95% range for z
ally the hypothesis favored by the evidence. Scientists do not, as
is .40 to .72. If these values are transformed to r, then the 95%
implied by James et al., conclude that nothing can be concluded
range for population correlations is .38 to .58, a band of width
(see also Schmidt, Hunter, et al., 1985, Q&A 32). Furthermore,
.20. That is, the James et al. confidence interval argument sug-
in light of the multiple forms of evidence against the situational
gests a range of .04 to .84, whereas the James et al. meta-analysis
specificity hypothesis, it is erroneous to state that a strong con-
suggests a range of .38 to .58. The band width is .80 for the
clusion represents the "error of affirming the consequent"
confidence interval argument and .20 (only 25% as wide) for
(James et al., 1986, p. 442). By this logic, all conclusions in any
their meta-analysis. At least one of their two analyses must be
field of science must be labeled errors of affirming
erroneous. The following discussion shows that the error is in
the conse-
quent. That is, if one holds that there are always alternate ex-
their confidence interval argument.
planatory models that fit the evidence equally well, then one
In estimating the variation of population correlations, it is
would never be justified in concluding in favor of any particular
clear that it would be a naive error to take the variation in ob-
explanation or theory. This means that no conclusions or expla-
served correlations at face value, that is, to use the standard
nations would ever be possible in science—a position of episte-
deviation of observed correlations as the estimate of the stan-
mological nihilism.
dard deviation of population correlations. Every method of
James et al. (1986) presented and analyzed a set of hypotheti-
meta-analysis—including the Fisher z method that we origi-
cal observed validity coefficients (their Table 1) that are consis-
nally used and that James et al. proposed to return to—recog-
tent with both the conclusion that there is no situational speci-
nizes that the standard deviation of observed correlations is
ficity and the conclusion that there is situational specificity. This
generally larger than the standard deviation of population corre-
analysis is based on a hypothetical data set specifically designed
lations. Yet the James et al. (1986) method of confidence inter-
to produce this particular outcome. The mathematically preferred method for addressing a statistical question is by analyti-
vals is even more inaccurate than is the naive use of the facevalue standard deviation. In their hypothetical example, meta-
cal means, that is, use of mathematical derivation. If this is not
analysis of correlations indicates that the 95% range for the p/ is
possible, then one may use simulation methods (Monte Carlo;
.39 to .61, a width of .22. The naive 95% range estimate, based
e.g., Callender, Osburn, & Greener, 1979; Raju & Burke, 1983)
on the observed standard deviation of .105, would on average
based on known statistical principles. Far less credibility can be
be .29 to .71, a width of .42, which is almost twice as wide as
given to designed data sets, such as that of James et al. Such
the meta-analysis range. The confidence interval method rec-
data sets may be inconsistent with statistical principles; that is,
ommended by James et al. produces on average a range of .04
they can easily be statistical flukes. For example, the James et
to .84, a width of .80, which is more than three times the meta-
al. designed data set is strangely platykurtic (flat) and shows
analysis range and almost twice the range yielded by the naive
very little indication of central tendency. However, even if this
face value method. This large error in their hypothetical exam-
were a real rather than artificial data set, the consistency of a
ple is no accident. The error will always be large unless sample
668
F. SCHMIDT, J. HUNTER, AND N. RAJU
sizes are so large that sampling error is trivial. The James et al. method makes no correction for sampling error; rather it multiplies the effect of sampling error.
Homogeneous Studies: An Example The nature of the error made by James et al. can be most easily seen in the case in which all population correlations are actually equal (homogeneous studies). ' Consider 30 studies conducted in different settings in which, unknown to the researcher, the population correlation is constant at .50. Because all population correlations are the same, the actual 95% range in population correlations is .50 to .50, with a width of zero. Assume that all studies have a sample size of 70. If 30 numbers are drawn randomly from a normal distribution, on average (i.e., in the median sample) the largest value will be 2.00 SDs above the mean and the smallest value will be 2.00 SDs below the mean. In this example, the range of correlations would on average be .32 to .68, with a band width of .36. Thus, a naive meta-analyst would be led to believe that the range of population correlations is .32 to .68 instead of .50 to .50, a massive error. On average, the largest correlation in our example is .68, which would have a 95% confidence interval of .53 to .79. Thus, .79 will be the largest of the upper bounds. The smallest correlation will on average be .32, which would have a confidence interval of .09 to .52. Thus, .09 will be the smallest of the lower bounds. The James et al. estimate of the 95% range of population correlations is thus .09 to .79 with a width of .70. In summary, the actual range of population correlations in this example is .50 to .50, a band width of zero. The observed sample correlations will on average have a 95% range of .32 to .68, a band of width .36. If this range is naively accepted as the estimate of the range of population correlations, the result is a very large error in estimation. The James et al. approach to confidence intervals produces an estimated range of .09 to .79, a band of width .70. Thus, the James et al. "sophisticated" approach is even more inaccurate than the naive estimate, almost twice as inaccurate. The optimal method of estimating the common population correlation is meta-analysis. If the meta-analysis shows the studies to be homogeneous, then the best single estimate of the population correlation is the average observed correlation across studies. The sampling error in that mean correlation is the second-order sampling error2 stemming from the fact that the number of studies is not infinite (Schmidt, Hunter, et al., 1985, Q&A 25). For homogenous studies, the formula for the standard error in the mean correlation is
Heterogeneous Case: An Example In the James et al. (1986) hypothetical example, the metaanalysis estimate of the standard deviation of population correlations is .054. If population correlations have a true standard deviation of .054 (rather than zero), then there will be two components to the variation in sample correlations, sampling error and real variation. The result is an addition of variances; the
variance of observed correlations will be the sum of the variance in population correlations plus the sampling error variance. In the James et al. hypothetical example, the observed variance would on average be .008152 + .002874 = .01 1026, resulting in a standard deviation of . 1 05. Across a set of 30 studies, the high and low sample correlations would on average vary by 2 SDs from the mean. Thus, the range of observed sample correlations would be on average .29 to .7 1 , a band of width .42. This band is much wider than the 95% range of .22: .42/.22 = 1 .9 1 , or almost twice as wide. The James et al. method produces a confidence interval of .07 to .83, a band of width .76. The meta-analysis estimate of the range of population correlations in their example is only .39 to .6 1 , or .22. The James et al. estimated confidence interval is too large by a factor of .76/.22 = 3.45. That is, their estimate is more than three times larger than the best estimate of the range. Furthermore, the James et al. method of estimating confidence intervals leads to an error that is even larger than the error made by a naive observer who ignores sampling error and relies on the standard deviation of the observed correlations. The James et al. error is larger by a factor of (.76 - .22)/(.42 - .22) = 2.70. That is, the James et al. error is 2.7 times larger than the naive error. The best estimate of the study population correlation and the most accurate estimate of its confidence interval is yielded by a Bayesian analysis using the mean and standard deviation of population correlations as determined from the meta-analysis of the studies. The best estimate of the mean population (uncorrected) correlation is the average correlation (with a small adjustment if the average sample size is less than 20; Hunter, Schmidt, & Coggin, 1986). The sampling error in the average correlation is larger for heterogeneous studies than for homogeneous studies because there is an additional term produced by variance in the population correlation. The sampling error variance in the average correlation is
where N is total sample size, K is the number of studies, and aj, is the variance of population correlations. Conventional confidence intervals apply only to isolated correlations. A proper confidence interval for individual study correlations in a metaanalysis can be obtained only by using the meta-analytic results in a Bayesian equation (Hedges, 1988; Raudenbush & Bryk, 1985). 1 In any given set of observed validities, there is situational specificity or there is not situational specificity. One or the other of these must be the actual state of nature. In a confusing passage, James, Demaree, and Mulaik (1986) make an erroneous statement to the contrary: "Fortunately, situational specificity and cross-situational consistency as described in this article are but two of many possible views" (p. 449). They then refer to a "middle ground between these two extremes," which is never defined. Logically, situational variance must be greater than zero or it must be zero. There are no other alternatives. 2 The comments in Footnote 1 of James, Demaree, and Mulaik (1986, p. 449) revealed that they believe that if predicted artifactual variance is greater than observed variance, the outcome can be due only to erroneous methods or erroneous calculations. This belief ignores second-order sampling error (see Schmidt, Hunter, et al., 1985, Q&A 25, andCallender&Osburn, 1988).
669
A SECOND LOOK
There Are No Double Corrections
not change the relative scaling of individuals on the criterion (Brogden & Taylor, 1950). Criterion contamination has a very
James et al. (1986) stated that it is not reasonable to attribute up to 25% of observed validity variance to artifacts for which corrections cannot be made. They concluded that the require-
specific statistical definition; it does not include every flaw or imperfection in a criterion measure. Finally, James et al. stated that clerical errors involved in criterion or predictor measure-
ment that 75% of the variance be explained should be raised to
ment "should influence variation in validities via variation in
some higher value (e.g., 90%). However, they declined to specify
CR and PR, respectively" (p. 444). Clerical errors in recording,
a precise critical percentage because "we need to allow for the
transcribing, or entering predictor or criterion data would al-
cumulation of research that indicates the proportion of S? typically accounted for in VG analysis" (p. 445). However, such cumulative knowledge is currently available (Schmidt, Hunter,
ways be expected to affect observed validities. However, only in rare cases would a study carry through reliability computations in such a way as to produce an exactly compensating decrease
et al., 1985, Q&A 27). James et al. found it "incongruous" that
in the reliability estimate. For example, reliability estimates are
the variance attributed to artifacts not corrected for could be
often calculated on only a subset of the study subjects because
more than three times greater than that attributed to differences
ratings by two different supervisors cannot be obtained for all
between studies in range restriction, criterion reliability, and
subjects. In any event, variations in criterion reliability that are
test reliability. However, they did not consider the most impor-
due to such clerical errors have not been included in the reliabil-
tant of the unconnected artifacts—computational, transcrip-
ity distributions used in the majority of VG studies, and thus
tional, programming, and other data errors. (All of these they apparently included under "clerical errors") Errors of these kinds are much more frequent than we would like to believe
the effects of such clerical errors have not been removed through the corrections for variation in criterion reliability across studies.
(Tukey, 1960; Wolins, 1962). Tukey maintained that all real data sets contain errors. Consider also the following statement byGulliksen(1986): I believe that it is essential to check the data for errors before running my computations. I always wrote an error-checking program and ran the data through it before computing. I find it very interesting that in every set of data I have run, either for myself or someone else, there have always been errors, necessitating going back to the questionnaires and repunching some cards, or perhaps discarding some subjects, (p. 4)
Bias in r and in Fisher's z James et al. (1986) next turned to the long-known fact that there is a slight downward bias in r as an estimate of p, the population correlation. As a result of this tiny bias, r and e are slightly correlated, leading to a very slight underestimation of the sampling error variance in r and in sets of rs (a miniscule conservative bias). The bias in r decreases as sample size increases; for № larger than 40, the bias is always less than .005;
Furthermore, such errors are likely to result in outliers, and out-
that is, it is less than rounding error and therefore disappears
liers have a dramatic inflationary effect on the variance. In a
when r is rounded to the usual two places. When p varies across
normal distribution, for example, the standard deviation is
studies, there is a very tiny negative covariance (and hence cor-
65%, determined by the highest and lowest 5% of data values
relation) between p and e. James et al. are concerned that this
(Tukey, 1960). In our judgment, data errors of various kinds
could lead to underestimation of SDt. This point was developed
could alone account for 25% of the observed variance in sets of
by Hedges (1982) specifically in terms of its implications for
observed validity coefficients, because the observed variance is
validity generalization research. James et al. cited Linn and
itself a very small number. Thus, although it is probably true
Dunbar (1982), who analyzed Hedges' specific critique in some
that the slight differences between predictors in factor structure
detail and concluded that "while Hedges' results are of theoreti-
account for little validity variance—and this may also be true
cal interest, it is unlikely that the missing covariance term
for study differences in criteria contamination and deficiency—
causes any serious difficulty in practice" (p. 15). Linn and Dun-
in our judgment the 75% rule can be justified on the basis of
bar also stated that the tiny departure from complete indepen-
data errors alone.
dence "is probably of no practical concern" (p. 15).
James et al. (1986) alleged that there may be double correc-
The trivial nature of the effect of the tiny negative correlation
tion for artifactual variance in VG procedures. They hypothe-
between p and e can be illustrated by the analysis of large sam-
sized that some of the effects of clerical errors and differences
ples of empirical data conducted by the third author. The data
between studies in criterion deficiency and contamination are
base consisted of 84,808 high school graduates tested with the
captured and eliminated by the corrections for variations in
Armed Services Vocational Aptitude Battery (ASVAB; U.S. De-
predictor and criterion reliability. In this connection, they
partment of Defense, 1984). The sum of scores on the Math
stated that "criterion contamination involves . . . random er-
Knowledge and Word Knowledge subtests was the predictor
ror" (p. 444). Random error is not a component of criterion
score; the criterion score was final grade in training. A total of
contamination (Brogden & Taylor, 1950); therefore, there is no
10 different, mutually exclusive populations were generated
double correction with respect to random error. They next
with approximately 8,000 subjects each; they were formed to
stated that "variation in systematic biases over studies also in-
have different degrees of range restriction on the predictor and
fluences variations in CR [criterion reliability]" (James et al.,
hence different predictor-criterion correlations. A description
1986, p. 444). This statement is irrelevant because such system-
of the procedures used to create the 10 populations is given in
atic biases—essentially variations in leniency or stringency—
Raju, Sussman-Pappas, and Parker (1986). To assess the corre-
are not a form of criterion contamination, inasmuch as they do
lation between the sampling errors and population correlations
670
F. SCHMIDT, J. HUNTER, AND N. RAJU
across the 10 populations, 100 different samples of 70 subjects each were drawn at random from each of the 10 populations. For each sample, a correlation coefficient (r,) and a sampling error (e, = r, - pi) were obtained. The e, and p, were then correlated across the 1,000 observations. Each p, was expressed in z form and the corresponding correlation between the population Fisher's z and the Fisher's z e, values was calculated. The correlation between sampling errors and population correlations was .0140 for r and .0167 for Fisher's z. These correlations are very small and close to zero, as expected from analytical considerations. Thus, one would expect little effect on the accuracy of estimates of sampling error variance. Also, correlations do not support the claim of superiority for the z coefficients advanced by James et al. (1986). The present empirical study should be replicated with larger samples. (Even with N = 1,000, the sampling standard error [.032] is large relative to the population correlation [which is approximately —.02 for r based on analytic formulas). But the observed value of r [.0140] is not significantly different from the analytically expected value.) However, the general finding concerning the trivial magnitude of the correlation between sampling errors and population correlations for the r and z coefficients is not likely to be very different from the one observed here. Callender and Osburn (1988) presented additional evidence, based on extensive simulation studies, that the standard formula for the sampling error variance of the correlation is quite accurate, even when the pt vary widely. James et al. (1986) pointed out that our early validity generalization research was conducted using Fisher's z, stating that we dropped Fisher's z in favor of using correlation "under the assumption [italics added] that the formula for sampling error based on correlations was 'very accurate' (Schmidt et al., 1980, p. 660)" (p. 448). On the same page that they cited here, we stated: "However, Callender, Osburn, and Greener [1979] (Note 7) have shown in simulation studies that the formula for the sampling error of the correlation coefficient that we now use in both procedures is very accurate" (Schmidt, Gast-Rosenberg, & Hunter, 1980, p. 660). Thus, the change in procedures was based on evidence, not on an assumption. James et al. (1986) challenged the equation