Personnel psychologists have traditionally believed that employment test valid- ities are situation ...... Equal Employment Opportunity Commission. (EEOC, 1970) ...
Journal of Applied Psychology 1977, Vol. 62, No. S, 529-540
Development of a General Solution to the Problem of Validity Generalization Frank L. Schmidt
John E. Hunter Michigan State University
U. S. Civil Service Commission and George Washington University
Personnel psychologists have traditionally believed that employment test validities are situation specific. This study presents a Bayesian statistical model which allows one to explore the alternate hypothesis that variation in validity outcomes from study to study for similar jobs and tests is artifactual in nature. Certain outcomes using this model permit validity generalization to new settings without carrying out a validation study of any kind. Where such generalization is not justified, the procedure provides an improved method of data analysis and decision making for the necessary situational validity study. Application to four distributions of empirical validity coefficients demonstrated the power of the model. A recent study (Schmidt, Hunter, & Urry, 1976) addressed the belief, dominant in personnel psychology, that meaningful empirical validation studies are possible for most, if not all, jobs in most organizations. This study showed that, because of range restriction and less than perfect criterion reliability, the sample sizes necessary to provide adequate statistical power are usually much larger than has typically been assumed, This finding leads to the conclusion that empirical validity studies are "technically feasible" much less frequently than the profession has assumed. The present study is addressed to another of the orthodoxies of personnel psychology: the belief that test validity is generally highly situation specific. Considerable variability from study to study is observed in raw validation results even when jobs and tests appear to be similar or essentially identical (Ghiselli, 1966). The explanation usually advanced for this phenomenon is that the factor structure of job performance is different from job to job and that the human observer or job analyst is simply too poor an information receiver
An earlier and shorter version of this article was given the 1976 James McKeen Cattell Research Design Award by Division 14 of APA. Requests for reprints should be sent to Frank L. Schmidt, Personnel Research and Development Center, U.S. Civil Service Commission, 1900 E Street, N.W., Washington, B.C. 2041S.
and processor to detect these subtle but important differences. Therefore, it is concluded, empirical validation is required in each situation and validity generalization is essentially impossible (Albright, Glennon, & Smith, 1963, p. 18; Ghiselli, 1966, p. 28; Guion, 1965, p. 126). This harsh "fact" is widely lamented, and it is said that our inability to solve the problem of validity generalization is perhaps the most serious shortcoming in selection psychology today (Guion, 1976; APA, Division of Industrial-Organizational Psychology, Note 1). The inability to generalize validities precludes development of the general principles of selection that could take the field beyond a mere technology to the status of a science (Guion, 1976). But there is evidence suggesting that much of the variance in the outcomes of validity studies within job-test combinations may be due to statistical artifacts. Schmidt, Hunter, and Urry (1976) have shown that under typical and realistic validation conditions, a valid test will show a statistically significant validity in only about 50% of studies. As one specific example, they showed that when true validity for a given test is in fact constant at .45 in a series of jobs, criterion reliability is .70, the prior selection ratio on the test is .60, and sample size is 68 (the median over 406 published validity studies; Lent, Aurbach, and Levin, 1971), the test will be reported to be valid 54% of the time and invalid 46% of
529
530
FRANK L. SCHMIDT AND JOHN E. HUNTER
the time (p < .05, two-tailed test). These are the kinds of results that are in fact observed in the literature (Ghiselli, 1966). For an empirical example, the reader is referred to Bender and Loveless (1958), who report a series of validity studies across time involving the same job and tests; their results are explained beautifully by the present theory. When a sample size is adequate to provide appropriate levels of statistical power, the observed results are quite different. In a wellexecuted large sample series of studies, it was found that when Army occupations were classified rationally into job families, tests showed essentially identical validities and regression weights for all jobs within a given family (Brogden, Note 2), Further, new jobs assigned rationally to job families also fit this pattern. Finally, these validities have held constant since the end of World War II, when they were determined. Brogden has concluded that when methodological artifacts are controlled and large samples are used (here hundreds and often thousands), obtained validities are in fact quite stable and similar across time and situations for similar jobs. A third piece of evidence comes from a study by Schmidt, Berner, and Hunter (1973) in a closely related area that dramatically illustrates the extent to which selection psychologists can allow themselves to be deceived by "empirical data." Prior to this study, some 19 studies had reported instances of single-group validity, that is, cases in which a given test showed a significant validity for the majority but not the minority group, or vice versa. Most of these "empirical" studies, however, employed ex post facto analyses that capitalized heavily on chance; in addition, the effect of between-group differences in sample sizes was overlooked. The Schmidt, Berner, and Hunter (1973) study demonstrated that the results of these studies—some 410 validity pairs in all—were nicely fit by a statistical model that assumes all instances on single-group validity are due solely to chance. Given the scientific canon of parsimony, the unavoidable conclusion is that these psychologists were busy studying and researching a phenomenon that did not exist. The same thing may be true in the case of the phe-
nomenon of validity differences within similar jobs. The purpose of the present study is to test this hypothesis, and as an extension of this test, to develop a new method of data analysis for criterion-related validity studies based on the principles of Bayesian statistics. Conceptually, the test of the hypothesis that variation in true validities within jobtest combinations is essentially zero is relatively straightforward. One need only locate a fairly large number of obtained validity coefficients, convert them to Fisher's z and then subtract from the variance of this distribution variance due to various sources of error. Sources of error variance include small sample sizes, computational and typographical errors, differences between studies in criterion reliability, differences between studies in amount and kind of criterion contamination and deficiency (Brogden & Taylor, 19SO), and differences between studies in degree of range restriction. If, after substracting variance due to these sources, the variance of the distribution of the validity coefficients is essentially zero, the hypothesis is confirmed. Even if the remaining variance is not zero, there will probably still be important implications. After correcting the mean of this distribution for attenuation due to criterion unreliability and for range restriction (based on average values of both), it may become apparent that a very large percentage, say 95%, of all values in the distribution lie above the minimum useful level of validity. In such a case, one could conclude with 95% confidence that true validity was at or above this minimum level in a new situation involving this test type and job without carrying out a validation study of any kind: Only a job analysis would be necessary in order to ensure that the job at hand was indeed a member of the class in question. In cases in which the mean of the corrected distribution is too low and/or the variance too great to allow conclusions of this kind, the corrected distribution will still be useful—as the prior distribution in a Bayesian study of the test's validity. A Bayesian approach to test validation has important advantages over traditional maximum likelihood methods. A major shortcoming of the maximum likelihood statistical
DEVELOPMENT OF A GENERAL SOLUTION
methods used in conventional validation procedures is that only sample-derived information is used in making inferences and decisions about test validity. Relevant information accumulated as a result of past studies done on the same test or test type and the same or similar jobs is not used in validity estimation. Such information may influence initial choice of tests to be included in the study, but final decisions about test validity are based on study outcomes only. Thus even if all previous studies had shown Test A to be highly valid for Job B, it would be concluded to be invalid if the computed validity coefficient failed to reach significance in the specific study at hand. The Bayesian approach, on the other hand, utilizes both sample data and prior knowledge in estimating validity and weights each in proportion to its information value. The sample information is combined with the prior distribution to produce a posterior distribution, the mean (or mode) of which is then taken as the estimate of the parameter of interest, in this case, test validity. Confidence intervals—called credibility intervals by Bayesians—can be placed around this mean. The statistics of this process are usually straightforward and can easily be programmed (Novick & Jackson, 1974). Thus by incorporating the corrected distribution of validity coefficients in Fisher's z form as a Bayesian prior distribution, the model presented here directly relates methods of data analysis used in making inferences about validity in criterion-related validity studies to the concept of validity generalization. The generalizability of validity is seen to be a matter of degree and is quantified in the properties of the prior distribution. Examination of these properties provides a direct answer to the question of whether validity generalization is justified or not without a situationspecific empirical validation study. If such generalization is justified, the (often high) cost of an empirical validation study is avoided. If it is not justified, the theory presented here provides an improved model and method for data analysis and decision making in the required empirical validation study. Thus we have a model that is both elegant and powerful.
531
Table 1 Example of Assumed Distributions of Criterion Reliabilities Across Studies Reliability .90 .85 .80 .75 .70 .65 .60 .55 .50 .45 .40 .35 .30
Relative frequency 3 4 6 8 10 12 14 12 10 8 6 4 3
Note. Expected value (reliability) = .60.
Procedure How does one proceed in correcting observed variance for error variance due to the sources listed above? First, consider variance due to sample size. In this case, one need only know the average sample size across published studies. A recent review of 406 studies (Lent, Aurbach, & Levin, 1971b) places this figure at 68. As shown in Appendix A of this article, variance due to sample size can be estimated as 1/(N — 3) or 65-1. This estimate is conservative since published studies tend to average higher sample sizes than unpublished studies (Guion, 1965, p. 126). In the case of variance due to differences between studies in criterion reliability, one assumes a reasonable distribution of reliabilities across studies and then determines the amount of variance this distribution would contribute to the observed distribution of validities. Table 1 shows such an assumed distribution of criterion reliabilities. The same procedure is followed in the case of differences between studies in range restriction; an example of effects can be seen in Table 2. In the case of both criterion reliability and range restriction, the information necessary to determine actual values in individual studies is not presented in the vast majority of research reports (Jones, 1950). Thus one must rely on reasonable assumed distributions of these effects; for reasons given later, these distributions should usually be conservative. The procedures by which one computes estimates of variance due to criterion reliability are given in Appendix B, and the procedures for computing variance due to range restriction effects are given in Appendix C. After computation, all three of these variances are subtracted from the observed variance, providing the final estimate of true situational variance, that is, variance due to true differences between tests and
FRANK L. SCHMIDT AND JOHN E. HUNTER
532 Table 2
Examples of Assumed Distribution of Range Restriction- Effects Across Studies Prior selection ratio 1.00 .70 .60 .50 .40 .30 .20 .10
SD of test 10.00 7.01 6.49 6.03 5.59 5.15 4.68 4.11
Relative frequency 5 11 16 18
18 16 11 5
Note, Expected value (SD) = 6.0.
jobs. The reader will note that no correction has been made for differences between studies in amount and kind of criterion contamination or deficiency or for computational and typographical errors. Although computational and typographical errors are probably more frequent than usually assumed (VVolins, 1962), it is difficult to estimate their frequency or magnitude and thus difficult to correct for them. In the case of criterion deficiency or contamination, corrections would be even more difficult. In addition, not correcting for these sources of error ensures a conservative procedure; that is, the corrected variance tends to overestimate rather than underestmate true variance. A computer program that makes the corrections described above was written and applied to four validity distributions presented by Ghiselli (1966, p. 29). These distributions contain both published and unpublished validity coefficients. Application of the D'Agostino and Cureton (1972) test, the most powerful such test available, showed that the distributions in Fisher's z form did not depart significantly from normality, thus allowing use of the normal model with these data. Criterion reliability and range restriction inputs were those shown in Tables 1 and 2, respectively.
Results Table 3 shows the standard deviations of validity distributions that would be observed if there were in fact no true validity differences and all observed variance was due to various artifactual sources and combinations of sources. These figures can be compared with the observed standard deviations (column 7). In the case of the mechanical repairmen, for example, given only criterion reliability and range restriction differences be-
tween studies and an average sample size of 68, the expected standard deviation is .189 (column 5). This compares with an observed standard deviation of .205. The fact that the latter is larger may stem in part from the existence of real differences between jobs, but since not all artifactual sources of variance have been included, as noted earlier, this conclusion is far from certain. (This point is developed further in the Discussion section of this article.) If, for this same job-test type combination, average sample size had been 30 (an unrealistic assumption), observed variance would have actually been smaller than predicted variance. In the case of every job-test type combination, the standard deviation predicted from the three artifactual sources, given realistic estimates of average sample size (SO and 68), is smaller than the observed standard deviation. The results of central interest are shown in Table 4. For general clerks and mechanical repairmen, the corrected distributions (priors) are such that one can be virtually certain that the tests in question are valid in a new setting without carrying out a criterionrelated validity study. For the mechanical repairmen, we are 97.5% confident that true validity is .70 or higher, and for the clerks we can have this same degree of confidence that true validity is at least .40. In the case of the mechanical repairmen, not only is a validity study unnecessary, but it would be extremely difficult for the results of such a study, when used in a Bayesian analysis, to alter the conclusion that the test is valid. The variance of the prior is so low and the mean so high that even in the unlikely event (p^ .00) that the study (n ~ 68) produced an estimate of true validity of —.34, the final estimate of test validity would still be .55. In the case of the clerks, if a validity study were conducted, the prior would strongly influence but not dominate the final results: Study results could alter the a priori conclusion that the test is valid. In the case of the bench workers, the 97.5% credibility interval includes zero. The 95% confidence interval (not shown in Table 4) goes down to a correlation coefficient of .03. Thus this distribution, although it could be used as a prior
DEVELOPMENT OF A GENERAL SOLUTION
533
Table 3 Expected Standard, Deviations of Fisher's z Validity Distributions Given No True Differences Between Jobs and Tests Source of variance Job —test type combination Mechanical repairman, mechanical principles" Bench workers, finger dexterity1" General clerks,6 intelligence Machine tenders, 11 spatial relations a b
1. Range 2. Criterion restriction unreliability
1 and 2 1 and 2 1 and 2 Observed 1 and 2 (W=30) (Ar = 50) (A' = 68) SD
.081
.080
.142
.239
.203
.189
.205
.037
.041
.056
.200
.157
.136
.262
.068
.088
.111
.222
.184
.167
.263
.005
.005
.007
.193
.146
.124
.219
Training criteria. Proficiency criteria.
in a Bayesian validation study, does not allow conclusions about validity in the absence of a study. The prior in the case of the machine tenders is exactly opposite to that for the mechanical repairmen. Not only can one not conclude that the test is valid from the prior alone, but use of the prior in a Bayesian validity study militates against the conclusion that the test is valid. The prior odds that the test is not valid are so high that an empirical study would have to provide very strong evidence to the contrary to alter this conclusion. (The a priori probability that test validity is .33 or above is only .OS.) The prior standard deviations in Table 4 vary from .08 to .23. It is important to note that the raw distributions from which these priors are derived were presented by Ghiselli as examples of what he considered to be an established fact, that is, the principle that validities vary widely even within similar jobtest combinations. Thus, in order to make his example more graphic, he may well have chosen test type-job type distributions from among the many he had available that showed particularly high variance. Variances may be smaller, and thus the power of the procedure even greater, when unselected distributions are examined. We are now in the process of applying the procedure to a number of different test types in the clerical occupations, in first-line supervision, and in the occupation of computer programmer.
Discussion In the present study, our concern was with the question of whether validity for a given job could be generalized to an entire population of tests of a given type, for example, mechanical principles tests. Therefore, variance due to differences between studies in test reliability was not subtracted from the variance of the prior. That is, this variance was not corrected for because it is an operational property of the population of tests to which we wanted to generalize. Actually, however, the validity generalization model developed in this article can be made more powerful and more accurate when its use is tailored to specific tests within the test type in question. For example, suppose a psychologist wanted to examine the question of validity generalization for mechanical repairmen specifically for the Bennett Mechanical Comprehension Test. The appropriate procedure would be to apply the model as done in this article except that (a) the prior is corrected for variance due to differences between studies in test reliability and (b) the mean of the prior is corrected for mean test unreliability. (When possible, a correction can also be made for variance due to slight differences between tests in factor structure, as discussed below.) Then in using the resulting prior in validity generalization, one attenuates its mean to correspond to the known reliability
FRANK L. SCHMIDT AND JOHN E. HUNTER
534
Table 4 Results of Study
Job
Mechanical repairman Bench workers General clerks Machine tenders
Test type
No.
Prior distribut ion
validity coefficients
Mean r" Mean z" SD z
1
Mechanical principles ' Finger dexterity" Intelligence" Spatial relations 0
114 191 72
99
.78 .39 .67 .05
1.03 .41 .81 .05
97.5% credibility value for validity .70
.08 .23 .20
-.04
.18
-.30
.40
a
Corrected for range restriction and attenuation due to criterion unreliability. Training criteria. 0 Proficiency criteria.
b
of the test he is considering adopting (rather than placing the mean at the value corresponding to average test reliability). Thus one tailors the mean of the prior to the reliability of the test in question. One also eliminates variance in the prior due to differences in test reliability, which become error variance when the test in question is held constant. This procedure provides a more powerful model, since it produces a prior with smaller variance. It provides a more accurate model, since results are tailored to specific level of reliability characterizing one's test. Our model can be used for two different purposes, and the two purposes require two different kinds of priors. Priors for the first purpose, validity generalization, were discussed above. The second purpose is to test the situational specificity hypothesis per se, which, as we have seen, holds that variation in validities is due to true differences between criterion measures in factor structure. Testing of this hypothesis is conceptually identical to research aimed at establishing general principles about trait-criterion relationships to be used in theory construction. This second purpose requires correction for variance in the prior due to test differences in reliability and correction of the mean of the prior for average test unreliability. The appropriate kind of reliability coefficient is the correlation between parallel forms of the test over a reasonable time period. (See the discussion below of appropriate criterion reliability estimates.) In theory, research of this sort also requires corrections for variance due to typographical and other clerical errors and difference be-
tween studies in amount and kind of criterion contamination, just as in the case of validity generalization. But as we have seen, there is no known way of making these corrections. Finally, tests of the situational specificity hypothesis require correction for variance due to the slight differences in factor structure that probably exist from one test to another in a given test type (e.g., the Wonderlic vs. the Purdue Adaptability test). This source of variance would be essentially impossible to estimate directty; however, if sufficient data became available to allow establishment of a separate prior for each test, the variance of the means of these different priors would provide the needed estimate. Inability to attain such an estimate may not be catastrophic; recent research suggests that this source of variance will be small (Schmidt, Note 3). Only priors with all of the above corrections made can provide the proper and ultimate test of the situational specificity hypothesis. All of these corrections are necessary, since under the situational specificity hypothesis, only variance due to factor structure differences between criterion measures is true variance. All other variance is error variance. The same is true in the case of research aimed at the establishment of general principles about trait-criterion relationships. The goal of such research is to reveal relations among underlying constructs, independent of measurement problems. The corrected priors in this study, then, were developed with an eye to validity generalization rather than to an optimal test of
DEVELOPMENT OF A GENERAL SOLUTION
the situational specificity hypothesis. But even though only some of the corrections necessary to an optimal test of this hypothesis were made (specifically, only three out of seven), the resulting priors still showed rather small variances. Suppose all of the corrections described above could be made, including corrections for variance due to clerical and typographical errors and differences between studies in amount and kind of criterion contamination. How much variance would then remain within which the hypothesized situational moderators (or moderator variables of any kind) could operate? The evidence indicates that very little would remain. This question is examined more fully in Schmidt (Note 3). Another question that might be raised concerns the mean for the assumed distribution of criterion reliability coefficients. In this study, the mean value was .60 (see Table 1). In light of the fact that many studies containing reliability estimates report coefficients in the .70s and .80s, isn't the .60 value too low (thus leading to overcorrections of the mean of the prior) ? There are two important reasons why this question should be answered in the negative. First, the kinds of reliability coefficients typically reported are overestimates of the appropriate reliability coefficients. Because most validity studies use supervisory ratings, we will discuss this point in those terms. The appropriate reliability coefficient for ratings used as criteria in validity studies is the correlation across a reasonable time interval between ratings produced by different raters at Time 1 and Time 2. This follows from the fact that true score on the criterion is defined as independent (a) of minor fluctuations in job performance across time, thus requiring that such fluctuations be assigned to error variance, and (b) of idiosyncracies of individual raters, thus requiring that variance due to this source be assigned to error variance. Interrater criterion reliability coefficients reported in the literature typically are computed at one point in time and are thus overestimates. When ratings are made at two different times with an intervening interval, the reliability coefficients reported are typi-
535
cally intrarater. That is, the same, rather than different, raters are used, leading again to overestimates. When only one rater is employed (and sometimes when more than one is employed), reliabilities reported are often internal consistency reliabilities of multiple scale instruments. Although use of multiple scale instruments may reduce error due to scale idiosyncracies, such internal consistency estimates put both of the sources of error variance discussed above into true variance and are thus overestimates of the appropriate coefficient. The second reason why .60 is probably not an underestimate of true mean criterion reliability stems from probable biases in the reporting of reliabilities. Jones (1950) found that only 22% of studies report criterion reliabilities. These studies are apt to be, on the average, the more methodologically sound ones, and as such they are more likely to have employed reliable criterion measures. Thus the coefficients they report—even if they are the proper kind—are apt to be higher, perhaps substantially higher, than the mean for all studies. A simple example can be given in the case of interrater reliability. The better conducted studies are apt to employ multiple raters, whereas many of the methodologically weaker studies are apt to employ only one rater. Thus by definition, the poorer studies cannot report interrater reliability. Because reliability increases with the number of raters, the low reliabilities are disproportionately not reported. These considerations lead us to the conclusion that if the correct reliability coefficients were reported for all studies, the mean reliability would probably be found to be below .60. However, despite these considerations, we did rerun our analysis using an assumed mean reliability of .65. Changes in the figures shown in Table 4 were very minor, and there was no change in conclusions about generalizability of validities. The next question that must be addressed in connection with this model is the possibility that validity coefficients available for inclusion in any given prior may not be representative of the relevant population of coefficients. This question is easily defined if
536
FRANK L. SCHMIDT AND JOHN E. HUNTER
priors are composed only of coefficients from the published literature. Wouldn't it be reasonable to assume that researchers are more likely to submit for publication, and editors are more likely to accept, studies reporting statistically significant and/or large validities than those reporting small and/or nonsignificant coefficients? The present study and those we have underway are based on unpublished as well as published coefficients. But this question may not disappear simply as the result of including unpublished studies, because the proportion of coefficients in the prior from unpublished studies may be less than the proportion in the coefficient population from unpublished studies. This outcome would be probable if published studies were more often located by the researchers than unpublished studies. There are other considerations, however, that indicate that this effect may be of little or no consequence. First, a major, and perhaps the major, source of validity coefficients is the Validity Information Exchange, which was published by Personnel Psychology between 1954 and 196S. The editorial policy of the Validity Information Exchange—which was made known to the journal's readers— was to accept submissions without regard to statistical significance of findings. Indeed, the editors make a special point of urging that studies showing nonsignificant results be submitted (Ross, 1961; Taylor, 1953). Of the 1,506 coefficients reported over the lifetime of the Exchange, 856 or STfo were nonsignificant (Lent, Aurbach, & Levin, 1971a). A second consideration is that most studies contain not one but a number of validity coefficients, making it highly probable that at least some will be statistically significant. For example, if statistical power in a study reporting 10 independent coefficients is onty .50, then the probability that at least one coefficient will be significant is .999. The probability that at least two will be significant is .989. For three, the probability is .95, and for four, it is .83. On the other hand, the probability that all will be nonsignificant is .510, which is essentially zero. What this means is that almost all studies will tend to have some significant coefficients, decreasing the prob-
ability that the researcher will base his decision to submit, or the editor his decision to accept, on statistical significance. But perhaps the most important consideration is the fact that if such a selective process were operating, the studies it would screen out would tend to be the methodologically poorer ones, that is, those containing information of poorer quality to begin with. Studies reporting all or almost all nonsignificant and/ or near zero validities will tend disproportionately to be those characterized by low statistical power—resulting, in turn, from low criterion reliabilities, use of small samples, and high levels of range restriction (cf. Schmidt, Hunter, & Urry, 1976). (In this connection, it is interesting to note that Boehm, 1977, found a significant negative relationship between methodological quality of studies and probability of reporting findings of single-group validity. Single-group validity has repeatedly been shown to be a chance phenomenon; see Hunter & Schmidt, in press, for a review of this research.) Thus studies reporting nonsignificant results may tend to be rejected not so much on that basis as on the basis of methodological weaknesses. Methodological weaknesses are, after all, not invisible to reviewers. Those studies—probably small in number—that are methodologically sound but nevertheless report only nonsignificant coefficients may have good probabilities of being accepted for publication. If the above hypothesis is true, what would be the effect of somehow being able to retrieve the rejected low quality studies and include their coefficients in the prior? The effect would be a decrease in the accuracy of the prior unless compensating adjustments were made in the assumed means and distributions of criterion reliabilities and range restriction effects and in the assumed average sample size. Specifically, the new distribution of criterion reliabilities would have to have a lower mean and greater variance. The new distribution of range restriction effects would have to show greater mean range restriction and more variance in range restriction effects. And the assumed mean sample size would have to be smaller. The result would be that larger corrections would be made to the mean
DEVELOPMENT OF A GENERAL SOLUTION
and variance of the prior to allow for the lower quality of the input data. Future applications of the validity generalization model developed in this article can be more refined than those presented here. This refinement will be possible whenever sample size is known for each validity coefficient going into the prior. This information was not available for the four distributions of validity coefficients presented by Ghiselli (1966, p. 29) and used in this study. Knowledge of individual sample sizes allows both the uncorrected mean and variance of the pair to be computed with each validity (in Fisher's z form) weighted by its sample size, providing more accurate estimates of these two critical values. The correction for variance due to sample size then becomes
which is again a more accurate estimate than that used in the present study (see Appendix A). If this model is to be used to maximum effectiveness by personnel psychologists, it is important that reports of validity studies be more complete than they have typically been in the past. Most reports include sample sizes used, but many presently contain no information on degree of range restriction, criterion reliability, or test reliability. In addition, many studies that do contain reliability estimates report inappropriate types of reliability. Another problem is incomplete identification or description of the test used. If the test is a commercially published one, it is sometimes not clear whether it has been shortened for use in the study; if the test has subscales, it is sometimes not clear which subscales were used. If the test was constructed in house, its description is often inadequate to allow classification of the test. We strongly urge improvement in these reporting practices so that the validity generalization model can be applied effectively in the future. The following thought concerning the validity generalization model may have occurred to the reader: The value of a prior must certainly increase as the number of validity
537
coefficients it is based on increases, yet the information value of the prior, which is the reciprocal of its variance, may not increase with the addition of coefficients. Actually, the information value of the prior is indexed only by its variance, and its variance is essentially independent of the number of coefficients. The number of coefficients going into the determination of the prior is, however, a (nonlinear) index of the amount of confidence one can have in the estimate of information value provided by the variance. Estimates of variances, like estimates of means, are characterized by standard errors, and confidence intervals can be put around variance estimates, just as they are around estimates of means. This point will be further addressed in a future article. Another point about the prior is perhaps appropriate. When the prior is used in a Bayesian statistical analysis of the results of a criterion-related validity study, the proper procedure is to disattenuate the mean of the prior before multiplying it by the likelihood ratio (that is, by the results obtained in the individual study) to produce the posterior distribution. The mean of this posterior distribution is then corrected for attenuation due to criterion unreliability. This procedure ensures that all statistical manipulations are carried out on data uncorrected for attenuation. Results generated over time using this model may have implications for future revisions of the Federal Executive Agency (U.S. Department of Justice, 1976) and the U.S. Equal Employment Opportunity Commission (EEOC, 1970) guidelines on employee selection. Although both make some provision for limited validity generalization, these provisions may have to be modified in their focus and extended significantly in the future as positive results from the use of this model accumulate. In this connection, the following statement from the Federal Executive Agency Guidelines appears relevant: "Nothing in these guidelines is intended to preclude the development and use of other professionally acceptable techniques with respect to validation of selection procedures" (Section 60-3.12, p. S17S4).
538
FRANK L. SCHMIDT AND JOHN E. HUNTER
The general model developed in this article priors are obtained are criticized. In this may prove useful in areas of research other procedure, priors are empirically determined, than personnel psychology. Its application is based on data from past studies. Second, the assumptions made about belikely to be particularly fruitful in any area in which the research literature tends to con- tween-study variance in criterion reliability sist of multiple small-sample correlational and range restriction are conservative. That is, studies. An obvious example is the research less variance is assumed in these variables literature in clinical psychology on the effec- than in all probability is the case. Thus if tiveness of different kinds of psychotherapy. there is a bias, it is in the direction of under This literature contains multiple small-sample rather than overcorrecting the variance of the studies, many of which focus, in whole or in obtained distribution of validities. Third, certain sources of error variance in part, on correlations between different measured variables. As another example, Fleish- the obtained distribution are not corrected for, man (1975) has advanced the hypothesis that further ensuring conservatism. Computational the validities of aptitude tests for predicting and typographical errors are two such sources, success on psychomotor tasks are highly as mentioned earlier. Differences between specific to task types. This model might pro- studies in amount and kind of criterion convide a useful vehicle for carefully scrutinizing tamination and deficiency (Brogden & Taylor, that hypothesis. The alternate hypothesis, of 1950) is a third such source. Fourth, corrections made to the mean of course, is that the specificity may be more apthe prior for average range restriction effects parent than real. Other examples are left to are probably conservative. That is, this corthe imagination of the reader. The idea of applying Baj'esian statistics to rection probably tends to underestimate the test validation is apparently completely new. true mean of the corrected prior. Because the To our knowledge, it has never before been reasons for this are technical, they are exsuggested. It has almost certainly never been plained in Appendix D. In addition, correcempirically studied on a systematic basis. tions made to the mean of the prior for averThis fact is more understandable when one age criterion unreliability may also be conrecalls that social and behavioral scientists servative, as indicated earlier. Fifth, acceptance will be enhanced by the have begun to show interest in Bayesian confact that this procedure provides a parsimonicepts and methods only in the last few years ous, sophisticated, and technically sound solu(Novick, Note 4). In a conservative profession—which ours certainly is—any new tion for the overarching problem of validity idea is apt to encounter resistance. However, generalization. Properly exploited, it may lead the methods suggested here have a number of to large dollar savings by eliminating the need features that will tend to enhance credibility for man}' criterion-related validity studies. Sixth, the model can be extended to proand acceptance. vide an improved method of data analysis First, the Bayesian priors to be used are data based; they are not the subjective esti- and decision making in criterion-related vamates of one or more individuals, as is usually lidity studies. Finally, the model, when used in slightly the case. The Bayesian prior incorporates and modified form, provides a tool that may lead represents the results of all available past validity studies on the job-test type combina- to the establishment of general principles tion in question. This point cannot be too about trait-performance relationships in the strongly stressed. Bayesian statistics have world of work, thus enabling the field of been controversial for only one reason: personnel psychology to move beyond a mere technology to the status of a science. Critics dislike the idea of the researcher's pulling prior distributions out of his head, Reference Notes based only on hunch and guess. No one questions the mathematics of Bayesian statistics, 1. American Psychological Association, Division of given the prior. Only the processes by which Industrial-Organizational Psychology. Principles
DEVELOPMENT OF A GENERAL SOLUTION for the validation and me of personnel selection procedures. Dayton, Ohio: The Industrial-Organization Psychologist, 1975. 2. Brogden, H. E. Personal communication, 1970. 3. Schmidt, F. L. Moderator research and the law of small numbers. Paper presented at the Conference on Moderator Research, University of Maryland, March 3-4, 1977. 4. Novick, M. R. Bayesian methods in educational testing: A survey. Paper presented at the Second International Symposium on Educational Testing, Bern, Switzerland, June 29-July 3, 1975.
References Albright, L. E., Glennon, J. R., & Smith, W. J. The uses of psychological tests in industry. Cleveland, Ohio: Howard Allen, 1963. Bender, W. R. G., & Loveless, H. E. Validation studies involving successive classes of trainee stenographers. Personnel Psychology, 1958, 11, 491-508. Boehm, V. R. Differential prediction: A methodological artifact? Journal of Applied Psychology, 1977, 62, 146-154. Brogden, H. E., & Taylor, E. K. A theory and classification of criterion bias. Educational and psychological measurement, 1950, 10, 159-186. D'Agostino, R. B., & Cureton, E. E. Test of normality against skewed alternatives. Psychological Bulletin, 1972, 78, 262-265. Fleishman, E. A. Toward a taxonomy of human performance. American Psychologist, 1975, 30, 1127-1149. Ghiselli, E. E. The validity of occupational aptitude tests. New York: Wiley, 1966. Guion, R. M. Personnel testing. New York: McGrawHill, 1965. Guion, R. M. Recruiting, selection, and job placement. In M. D. Dunnette, (Ed.), Handbook of industrial-organizational psychology. Chicago: Rand McNally, 1976. Hunter, J. E., & Schmidt, F. L. Differential and single-group validity of employment tests by race: A critical analysis of three recent studies. Journal of Applied Psychology, in press. Jones, M. H. The adequacy of employee selection reports. Journal of Applied Psychology, 1950, 34, 219-224. Lent, R. H., Aurbach, H. A., & Levin, L. S. Research design and validity assessment. Personnel Psychology, 1971, 24, 247-274. (a) Lent, R. H., Aurbach, H. A., & Levin, L. S. Predictors, criteria, and significant results. Personnel Psychology, 1971, 24, 519-533. (b) Novick, M, R., & Jackson, P. H. Statistical methods for educational and psychological research. New York: McGraw-Hill, 1974. Ross, R. A. A new invitation for validity exchange manuscripts. Personnel Psychology, 1961, 14, 1-7. Schmidt, F. L., Berner, J. G., & Hunter, J. E. Racial differences in validity of employment tests: Reality or illusion? Journal of Applied Psychology, 1973,
539
S3, 5-9. Schmidt, F. L., Hunter, J. E., & Urry, V. W. Statistical power in criterion-related validity studies. Journal of Applied Psychology, 1976, 61, 473-485. Taylor, E. The validity information exchange. Personnel Psychology, 1953, 6, 265-270. Thorndike, R. L. Personnel selection. New York: Wiley, 1949. U. S. Department of Justice. Employee selection guidelines. Federal Register, 41, No. 227, November 23, 1976, Part III, 51734-51759. U.S. Equal Employment Opportunity Commission. Guidelines on employee selection procedures. Federal Register, 1970, 35 (149), 12333-12336. Wolins, L. Responsibility for raw data. American Psychologist, 1962, 17, 657-658.
Appendix A Computing Variance Due to Sample Size If a total distribution is obtained by pooling across several other distributions, then by a wellknown theorem in analysis of variance, we know that the total variance of that distribution is the variance of the subpopulation means plus the mean of the subpopulation variances. In particular, if sample correlations in Fisher's 2 form drawn from populations are computed on varying sample sizes, then the "means" of the resulting distributions of sample correlations are the corresponding population correlations in Fisher's z form while the "variances" of these distributions are the variances due to sampling error in those correlations. Thus the total variance will be the variance of the means, which is the variance of the population correlations, plus the mean of the within population variances, which is the mean of the sampling distribution variances. The variance of the population correlations is precisely what is estimated by the variance of the corrected prior, as explained in the text. The "variances" to be averaged are the variances in the correlation coefficients due to sampling error. For Fisher's z, this would be the average (expected value) of I / ( T V - 3 ) or £[!/(#-3)]. In general, E[l/(N - 3)] > I/[E(AO - 3], and thus the use of l/[E(N) — 3] leads to an undercorrection for sample size variance. However, in this case the conservative bias is minimal. I/ [E(N] - 3] = .0154, and E(\/N - 3), as computed from the TVs given in Lent et al. (1971a), is .0156. In this study, the smaller of these two figures was used as the estimate of variance due to sample size.
FRANK L. SCHMIDT AND JOHN E. HUNTER
540
Appendix B
between studies is then
Computing Variance Due to Differences Between Studies in Criterion Reliability 1. Compute mean of the validity distribution in Fisher's z form and convert to r. 2. Correct this raw r for criterion unreliability and range restriction using average values across studies for both. (In the pilot study, average assumed criterion reliability was .60 and average range restriction was to a standard deviation of 6.0 from an unrestricted standard deviation of 10.0; see Tables 1 and 2 in text.) This provides an estimate of the mean true validity, rM. 3. For each value of assumed criterion reliability, Yin, compute r,s,ijrtii and convert this attenuated r to z. Compute Sz,-••;;,- and 2zr ••;/,-, where «,- = the relative frequencies of the criterion reliabilities. 4. Variance due to reliability differences in z distribution of validities is then 2z,-•«.,-1 2 Ztt;
Appendix C Computing Variance due to Range Restriction Differences Between Studies 1. Compute mean of the validity distribution in Fisher's z form and convert to r. Correct this raw r lor mean range restriction but not for attenuation. 2. For each value of the restricted standard deviation, use the following formula to compute the expected restricted r: fufR - R2 + I' where r,: = the restricted validity, R = the unrestricted validity, •«,- = sdi/SD, SD = the standard deviation of the test in the unrestricted group, and sd{ = the standard deviation of the test in the restricted group. This formula is obtained by solving Thorndike's (1949, p. 173) Case II formula for n. (Thorndike's Case II is the model throughout these analyses; use of Case III would generally produce very similar results.) Convert /•; to z and compute Sg,-«; and 4. Variance due to range restriction differences
Appendix D Effect of Range Restriction on the Mean of the Prior Our corrections for range restriction assume truncation is on the predictor only (Thorndike's Case II). Some validation studies are probably carried out in situations in which there has been some degree of truncation on the criterion. If there is truncation on the criterion but none on the predictor, Thorndike's Case I becomes the relevant model. For example, a life insurance agency may accept all who apply regardless of test score but systematically weed out the bottom 50% in performance the first year. The requirement of the Thorndike Case I model that there be no truncation at all on the predictor is probably rarely met in validity studies, and thus this model is probably never fully appropriate. But to the extent that a prior contains coefficients computed after truncation on the criterion, that is, to the extent that Case I is even partially appropriate for some of the coefficients, our use of the Case II model leads to underestimates of the effect of range restriction on the mean of the prior. An examination of the properties of the Case I formula shows that this effect would be substantial. (However, this effect may be partly offset by the fact that the Case II formula provides slight overestimates of corrected coefficients when truncation on the test is not perfect, for example, when some applicants below the cutoff score are selected or when selection is on the basis of the sum of scores on two or more tests rather than a single test.) Our use of the Case II model also leads to an underestimate of the variance in the prior due to range restriction, since there is no allowance for variance due to differences between studies in extent of truncation on the criterion. But to the extent that, within a given job, degree of truncation on the criterion would tend to be similar from organization to organization (and therefore from study to study), this source of variance should not be large,
Received November 22, 1976