comprehension and chemical comprehension tests for both jobs. In the case of the chemical comprehension tests, virtually all variance of observed validity coef-.
Journal of Applied Psychology 1981. Vol M, No 3. 261-273
Copyright 1981 by the American PiychologKil Auocution. I K . 002I-9OIO/8I/66O3-O26IW0 75
Validity Generalization Results for Two Job Groups in the Petroleum Industry Frank L. Schmidt
John E. Hunter
George Washington University
Michigan State University
James R. Caplan University of Virginia, Northern Virginia Center This study estimated the transsituational generalizability (or "transportability") of the validities of four types of cognitive tests and a weighted biographical information blank for performance in two petroleum industry job groups. For the cognitive measures, generalizability was strongly supported for mechanical comprehension and chemical comprehension tests for both jobs. In the case of the chemical comprehension tests, virtually all variance of observed validity coefficients was accounted for by artifacts, and thus the hypothesis of situational specificity was rejected. Support for generalizability was substantial for general mental ability and arithmetic reasoning tests. In this study, unlike in previous studies, distributions of criterion reliabilities and restricted standard deviations were taken from study data. It was found, however, that corrections for variance due to sampling error accounted for an average of 90% of all variance due to artifacts, indicating the relative unimportance of differences between studies in criterion reliability and in range restriction in accounting for variation in observed validities. Generalizable multivariate validities were estimated for various test batteries using beta and unit weights. Finally, true score beta weights were used to estimate the causal role of the four cognitive abilities in job performance.
This study is one of a scries of studies applying the Schmidt-Hunter validity generalization model to validity data for different occupations (Pearlman, Schmidt, & Hunter, 1980; Schmidt, Gast-Rosenberg, & Hunter, 1980; Schmidt & Hunter, 1977; Schmidt, Hunter, Pearlman, & Shane, 1979). The purpose of this research program has been to test empirically the traditional belief in the situational specificity of employment test validities. A full description of the conceptual basis of this validity genThe research reported herein was funded by the American Petroleum Institute (API) and was assisted by the API Subcommittee on Personnel Selection. The members of this steering committee were T. J. Carron (Ethyl Corporation), T. E. Standing (Sohio), C. P. Sparks (Exxon), J. W. Herring (Exxon), T. B. Porter (Standard Oil Company [Indiana]), and W. B. Bryant (Shell). The authors would like to thank Kenneth Pearlman for his helpful comments on an earlier version of this article. Requests for reprints should be sent to Frank L. Schmidt, Department of Psychology, George Washington University, Washington, D.C. 20052.
eralization procedure can be found in the earlier studies, particularly those of Pearlman et al. (1980) and of Schmidt, GastRosenberg, and Hunter (1980). The purpose of this study is to apply the validity generalization model to available validity data for several occupations found in the petroleum industry. The study was conducted under the auspices of the American Petroleum Institute (API). Details beyond those given here can be found in Schmidt, Hunter, and Caplan (Note 1). Method The data used in this study were taken from validity studies conducted at participating petroleum companies. All companies on API's membership list were asked to participate in the validity generalization study. Thirteen petrochemical and chemical companies agreed to participate. Contributors were assured of company anonymity in the report. Research reports and supplementary data were sent to Rice University, where the information was extracted and coded.1 Members of the steering committee (see 1 This work was done by Elizabeth S. Scchler working under the supervision of William C. Howell.
261
262
F. SCHMIDT, J. HUNTER, AND J. CAPLAN
Author Note) assisted in making decisions regarding coding format and in clarifying some of the information contained in the reports. As in previous studies, tests were coded on the basis of the constructs measured (e.g., mechanical ability), and jobs were coded into job groups or families. In some reports, job descriptions from the Dictionary of Occupational Titles (U.S. Department of Labor, 1977) were supplied. In others, job duties as determined from a job analysis were given. In some cases, members of the API steering committee were asked to provide assistance in classifying jobs. The three most frequently classified job groups were operations, maintenance, and laboratory. Estimates of criterion reliability were frequently available and were recorded along with a code indicating the type of reliability (i.e., internal consistency, ratererate, or interrater agreement). An index of range restriction—the ratio of the incumbent to the applicant pool test standard deviation—was recorded when the necessary data were available. Criterion measures were classified as measures either of job performance (success) in training or of post-training job performance. Only validities based on measures of overall job or training performance were used in this study. Partial measures of performance (e.g., ratings on specific dimensions of job performance) were excluded. The results obtained thus apply to the job as a whole rather than to only a portion of the job. Where a number of dimensions of job performance were measured, the composite (sum or average) of these measures was taken as the measure of overall job performance. If such a composite was not available, a rating or assessment of overall performance was used. If neither was available, the data were not used. In only one study, however, was neither a composite nor an overall assessment available. Test-job combinations having eight or more independent validity coefficients were considered suitable for analysis using the Schmidt-Hunter validity generalization procedure. (From the Bayesian point of view, which weights each validity distribution according to its information value, there is no theoretical basis for imposing a lower limit on the number of coefficients required for analysis using the present model of validity generalization. Nevertheless, because there is a decrease in the accuracy with which the total variance of a distribution of empirical validity coefficients estimates the total variance in the relevant population of validities as the number of coefficients in the distribution decreases, we felt it desirable to establish such a minimum, even though it was to some extent arbitrary.) None of the testjob combinations based on training success criteria had eight or more coefficients. (Additional data are now being collected.) Nine of the job-test combinations based on job proficiency criteria had eight or more validities. These combinations were (a) operator job poup: mechanical comprehension, chemical comprehension, general intelligence, and arithmetic reasoning; and (b) maintenance job group: mechanical comprehension, chemical comprehension, general intelligence, arithmetic reasoning, and RBH background survey. The job codes in the Dictionary of Occupational Titles for the operator and maintenance job groups are shown in the Appendix. A small number of coefficients were dropped from certain of these test-job combinations. In the case of
the mechanical comprehension tests, all validity coefficients were based on different forms of the Bennett Test of Mechanical Comprehension for the operator job group. For the maintenance job group, this was true of all but one coefficient. This one coefficient was dropped so that all data and results for both jobs would be based on the Bennett test. In the case of the general intelligence measures, two of the coefficients for the operator job group and one coefficient for the maintenance job group were based on the beta examination, a nonverbal measure of intelligence. Since intelligence as measured by nonverbal items is potentially a different construct from intelligence as measured by verbal and numerical items, these three coefficients were dropped.
Data Analysis Criterion reliabilities. Because criterion reliabilities are virtually never corrected for attenuation due to indirect range restriction resulting from selection on a valid predictor (Schmidt, Hunter, & Urry, 1976), it could be safely assumed that the reported criterion reliability estimates had not been corrected. In the data used in this study, there was on the average only very limited range restriction on the predictors (see the discussion below and Table 3), and as a result, correction for range restriction would increase the reliability estimates by only a few points at most. The validity generalization model calls for the use of applicant pool criterion reliabilities, and all other things equal, the reported criterion reliability estimates would therefore be slight underestimates. All other things, however, were not equal. Specifically, all reported reliability estimates were found to be indices of interrater agreement, with the time interval between ratings less than I week in all cases. The kind of reliability that is appropriate for job performance ratings in estimating operational test validities is interrater agreement with a substantial intervening time interval (Schmidt & Hunter, 1977). The' reason for the time interval requirement is that the purpose of selection tests is not to predict employee performance at one point in time but rather to predict (mean) performance over time. Thus the estimate of reliability used should be one that assigns variance in performance due to temporal fluctuations to error variance. When raters provide their ratings at one point in time, this variance tends to be assigned to true variance, thus yielding an inflated reliability estimate. Had the appropriate time interval occurred between ratings, the computed reliabilities in this study would probably have averaged at least. 15 below the observed values. Making generous allowance for the slight negative bias due to ignoring the range restriction, we reduced all reliability estimates by .10 rather than by .15. That is, it was assumed that had the appropriate time interval occurred between ratings and had the reliabilities been computed on a fully unrestricted group, the computed reliabilities would have averaged .10 below the observed values. This assumption is conservative; in all likelihood, the coefficients could have been reduced by more than .10. The adjusted reliability estimates were assigned equal frequencies except where it was necessary to make the frequencies sum to 100, at which point an additional frequency was added. This additional frequency was placed as close as possible to the distribution mean. A
263
TWO OCCUPATIONS similar procedure was followed for range restriction effects, described below. Equal frequencies directly mirror the relative frequencies found in the data. (At first glance, this may appear to create a rectangular distribution of reliabilities; it does not, however, because the reliability estimates are not equidistant from each other.) Because the validity coefficients in the different job group test cells sometimes came from the same studies, different cells occasionally had the same reliability distributions. Criterion reliability distributions are shown in Table I. Test reliabilities. Test reliabilities were not reported in the validity studies from which data were taken. In estimating test reliabilities, it was necessary to consider the nature of the tests in each job group-test combination. For both the operator and maintenance job groups, the situation was as follows: Mechanical comprehension: AH tests were different forms of the Bennett Test of Mechanical Comprehension. Chemical comprehension: All tests were different forms of the RBH Chemical Comprehension Test. General Intelligence: Validity coefficients were based on (different forms of) four different instruments. Arithmetic reasoning: All tests were different forms of the RBH Arithmetic Reasoning Test. Thus, in the case of both job groups, little variance in test reliabilities was expected for three of the four cells. Distribution 1 in Table 2 was created for these cells.
Table 1 Criterion Reliability Distributions for Job-Test Combinations as Derived From American Petroleum Institute Data 4d
3C
2"
R
RF
R
RF
R
RF
R
RF
.82 .76 .75 .73 .73 .70 .65 .63 .56 .51
9 9 9 9 9 10 9 9 9 9
.83 .82 .70 .69 .63 .62 .61
12.5 12.5 12.5 12.5 12.5 12.5 12.5
.82 .76 .75 .73 .73 .70 .65
12.5 12.5 12.5 12.5 12.5 12.5 12.5
.82 .76 .75 .73 .73 .70 .65 .65 .65 .51
10 10 10 10 10 10 10 10 10 10
Note. R = reliability; RF = relative frequency. ' Distribution I used for operator-mechanical comprehension. b Distribution 2 used for maintenance-mechanical comprehension, chemical comprehension, general intelligence, arithmetic reasoning, and RBH background survey. ' Distribution 3 used for operator-chemical comprehension and arithmetic reasoning. * Distribution 4 used for operator-general intelligence.
Table 2 Distributions of Assumed Test Reliabilities Relative frequency
Reliability Distribution 1' .90 .85 .80 .75 .70 .60
5 10 20 30 20 15 Distribution 2"
.90 .85 .80 .75 .70 .60 .50 .40
8 20 30 20 10 5 5 2
Note. Distribution I was used for all job-test combinations except maintenance-general intelligence and operator-general intelligence, for which Distribution 2 was used. ' Mean test reliability = .75. * Mean test reliability - .76. This distribution assumes minimal variance in test reliabilities and a mean test-retest reliability over a reasonable time period (Schmidt & Hunter, 1977) of .75. For the fourth cell for each job (the general intelligence cell), it was necessary to assume more variance in test reliability, and thus Distribution 2 in Table 2 was used. Since there was no reason to assume lower mean reliability for the measures of general intelligence, the mean of Distribution 2 is approximately the same as that of Distribution 1. Finally, in the case of the maintenanceRBH background survey combination, all test reliabilities were assumed equal because the same form of the same instrument was used in computing each validity coefficient. Mean instrument reliability was estimated at .75. Range restriction effects. As required for use in the validity generalization program, the decimal points in all range restriction ratios were set such that the unrestricted (applicant pool) standard deviations were all equal to 10.00. All values of the restricted standard deviations were then expressed relative to unrestricted standard deviations of 10. For each cell, available restricted standard deviations were assigned equal frequencies, as was done with criterion reliabilities. Range restriction distributions are shown in Table 3. AH values of the restricted standard deviations are arranged in order from the highest to the lowest. In all cells except one, one or more of the restricted standard deviations are as large as or larger than the unrestricted value of 10. In one distribution, that for maintenance-arithmetic reasoning, 4 of the 10 values are greater than 10. In another cell, that for operators-arithmetic reasoning, 3 of the 9 values exceed 10. Taken at face value, these data seem to indicate that in some cases, and especially
Table 3 Range Restriction Effects Distributions for Job-Test Combinations as Derived From American Petroleum Institute Data Mechanical comprehension SD
General intelligence
Chemical comprehension RF
SD
RF
SD
Arithmetic reasoning
RBH background survey
RF
SD
RF
10 10 10 10 10 10 10 10 10 10
11.6 11.3 10.9 9.9 9.7 9.5 9.3 8.3 7.8 5.4
10 10 10 10 10 10 10 10 10 10
SD
RF
Operator 13.6 9.9 9.5 8.5 7.8 7.6 6.9 6.6 6.5 6.4 6.2 5.6 Mean SD - 7.9
8 8 8 8 9 9 9 9 8 8 8 8
11.6 9.5 9.3 9.0 8.7 8.2 7.4 7.1 6.8 6.2
10 10 10 10 10 10 10 10 10 10
13.3 10.4 9.4 8.1 8.1 8.0 7.7 7.5 6.6 4.6
M
n X
2 * La
X *~ •H
JO
Mean SD - 8.4
Mean SD - 8.4
Mean SD - 9.4
>
Z
Maintenance 9 9.2 8.8 9 8.7 9 8.6 9 8.2 9 7.3 10 7.1 9 7.1 9 6.9 9 6.1 9 9 5.6 Mean SD - 7.6
10.1 9.3 8.8 8.7 8.2 8.2 8.2 7.7 7.4 6.9 Mean SD - 8.4
10 10 10 10 10 10 10 10 10 10
10.3 9.9 9.8 9.5 9.1 7.5 7.4 7.3 7.3 5.5 Mean SD = 8.4
10 10 10 10 10 10 10 10 10 10
12.5 10.6 10.2 10.1 9.8 9.7 9.4 9.3 8.6 8.2 Mean SD = 9.8
10 10 10 10 10 10 10 10 10 10
10.00 7.01 6.49 6.03 5.59 5.15 4.68 4.11
5 11 16 18 18 16 11 5
Mean SD - 6.00
Note. RF * relative frequency. In all cases, the applicant pool (unrestricted) SD - 10.00. American Petroleum Institute data contained no information on range restriction in the case of the maintenance-RBH (Richardson, Bellows, St. Henry) background survey; thus the distribution of range restriction effects was estimated.
o o
>
265
TWO OCCUPATIONS in the case of arithmetic reasoning, incumbents are more variable than are applicants. But because study sample sizes were frequently small, it is likely that some or all of these large values for the restricted standard deviations are due to simple sampling error. We accept the data at face value, however, for the purposes of this study. If these values are artifactual, the effect is to underestimate both mean validities and degree of validity generalizability. Using the data described above, the hypothesis of situational specificity and the question of validity generalization were investigated using a recently developed validity generalization computational procedure (Schmidt, Gast-Rosenberg, & Hunter, 1980). This procedure differs from the procedure used in Pearlman et al. (1980) in that the variance due to criterion reliability differences, test reliability differences, and range restriction differences is computed simultaneously. In theory, this procedure should be slightly more accurate because it takes into account the small interaction that occurs between the effects of range restriction and criterion (and test) unreliability. Because of this, we have labeled it the interactive procedure. Both of our previous procedures (Pearlman et al., 1980; Schmidt, Hunter, Pearlman, & Shane, 1979) were "noninteractive," since all sources of artifactual variance were computed independently and then summed. The effect of criterion unreliability (or test unreliability) is less where range restriction is greater. For example, if true validity were .50 and criterion reliability were .80, then if there were no range restriction, the effect of criterion unreliability would be (.8O)I/2 (.SO) =•= .447, a reduction of .053. But if range restriction reduced the expected value of r to, say, .35, then the effect of the same criterion unreliability would be (.8O)'/2 (.35) = .313, a reduction of only .037. Our previous procedures ignore this interaction and act as if the main effects of criterion unreliability were the same at all levels of range restriction. In real data, this interaction is trivial in magnitude and does not affect conclusions drawn from the data (Schmidt, Gast-Rosenberg, & Hunter, 1980). Nevertheless, the interactive procedure, because it takes this interaction into account, should provide slightly more accurate estimates of the percentage of variance in observed validity coefficients that is due to the artifacts of test reliability, criterion reliability, and range restriction differences between studies. It should thus provide a slightly better test of the situational specificity hypothesis. It does not, however, provide more accurate estimates of SD,, the standard deviation of true validities. Callender and Osburn (1980), using simulation methods, showed that while both procedures slightly overestimate SD,, the interactive procedure overestimates slightly more than does the noninteractive procedure. That is, the interactive procedure is slightly more conservative. This difference, however, is trivial in size and will virtually never lead to false conclusions that validity is not generalizable. The interactive validity generalization procedure computes variance due to sample size in the same way the noninteractive procedure used in Pearlman et al. (1980) does. A full description of the interactive validity generalization procedure is given in the text and the Appendix of Schmidt, Gast-Rosenberg, and Hunter (1980).
As in our previous validity generalization studies, no corrections were made for differences between studies in the amount and the kind of criterion contamination or deficiency (Brogden & Taylor, 1950), for computational and typographical errors, and for slight differences between tests in factor structure. Although computational and typographical errors are probably more frequent than psychologists usually assume them to be (Wolins, 1962), it is difficult to estimate their frequency or magnitude and thus difficult to correct for them. In the case of criterion deficiency or contamination, corrections would be even more difficult. Not correcting for these sources of error makes the results even more conservative; that is, the values of the standard deviation of true validities tend to be overestimates, and thus the degree of generalizability (transportability) of validities is underestimated. Earlier we stated that the situational specificity hypothesis is rejected when the remaining variance is zero or near zero. Because of the conservative biases discussed here, such a decision rule is scientifically inappropriate in practice. An appropriate decision rule must take into account the fact that we correct for only four of seven artifactual sources of variance (cf. Pearlman et al., 1980; Schmidt, Gast-Rosenberg, & Hunter, 1980). We therefore have adopted the more realistic (but still conservative) decision rule that the situational specificity hypothesis should be rejected whenever 75% or more of the variance of the validity coefficients is accounted for by the four artifacts for which corrections are made.
Results and Discussion
Situational Specificity Hypothesis The results relevant to the situational specificity hypothesis are presented in Table 4. For each job group-test combination, Table 4 shows the standard deviation of the reported validity coefficients, the standard deviation predicted on the basis of the four statistical artifacts for which corrections were made, and the percentage of original observed variance accounted for by the artifacts. The predicted standard deviation is the square root of the sum of the variance due to sample size and the variance due to the combined effects of criterion reliability differences, test reliability differences, and range restriction differences. The predicted standard deviation is slightly larger than the observed standard deviation in one case. This is exactly the type of result to be expected in certain cases in which the situational specificity hypothesis is false: If there are no differences in true validities between jobs in a given job family, and if the unassessed sources of artifactual variance are negligible, the predicted standard deviation should fall
266
F. SCHMIDT, J. HUNTER, AND J. CAPLAN
Table 4 Observed and Predicted Standard Deviations and Percentage of Variance Accounted For Test type and job group Mechanical comprehension Operator Maintenance Chemical comprehension Operator Maintenance General intelligence Operator Maintenance Arithmetic reasoning Operator Maintenance RBH background survey Maintenance
Mean observed coefficient
Observed
sir
Predicted SO*
% variance accounted for
.22 .21
.131 .165
.107 .125
67 57
.22 .17
.113 .123
.108 .126
91 100
18 .21
.172 .177
.107 .122
39 48
.20 .12
.187 .180
.105 .127
32 50
-.03
.143
.124
75
Note. RBH = Richardson, Bellows, & Henry. * In r form.
slightly above the observed standard deviation about half the time and slightly below the observed standard deviation about half the time. In the case of the chemical comprehension tests, the situational specificity hypothesis is rejected for both job groups. In the case of the RBH background survey-maintenance cell, 75% of the observed variance is accounted for, allowing rejection of the situational specificity hypothesis. As we shall see below, however, the best estimate from these data of the validity of this instrument for the maintenance job group is near zero. Under our 75% decision rule, results for other test-job combinations do not allow rejection of the situational specificity hypothesis for either procedure. Thus the situational specificity hypothesis can be rejected for three of the nine test-job combinations. As we saw earlier, however, rejection of the situational specificity hypothesis, while desirable, is not a prerequisite for validity generalization. In previous validity generalization studies, virtually no empirical data were available on criterion reliabilities, test reliabilities, and degrees of range restriction associated with the observed validity coefficients used as input data. This fact necessitated the use of reasonable assumed distributions of these variables. In this study, data were available on criterion reliabilities and on amounts of
range restriction. But even here, the data were only partial, necessitating the generalization of available data to the coefficients lacking data. Reviewers occasionally have voiced concern about the use of assumed distributions, speculating that if it should turn out that such distributions are not conservative, validity generalization results might be overstated. In response to this concern, we have examined the data in our studies and have found that criterion reliability differences, test reliability differences, and range restriction differences taken together typically account for only a small portion of the variance accounted for by artifacts. Most of the variance accounted for by artifacts is accounted for by sampling error. Table 5 shows the relevant percentages for this study. Sampling error accounts for at least three quarters of the variance due to artifacts in every case. The average across test-job combinations is 90%>. The percentage of accounted-for variance due to artifacts other than sampling error is so small that conclusions would not differ if the effects of these artifacts were not taken out. This fact has been demonstrated empirically in two other studies (Pearlman et al., 1980; Schmidt, Gast-Rosenberg, & Hunter, 1980). In these studies, it was shown that when data were analyzed under the extreme and unrealistic assumption that
267
TWO OCCUPATIONS
Table 5 Percentage of the Variance Accounted For in Observed Validity Coefficients Due to Sample Size and Other Artifacts Test type and job group Mechanical comprehension Operator Maintenance Chemical comprehension Operator Maintenance General intelligence Operator Maintenance Arithmetic reasoning Operator Maintenance RBH (Richardson, Bellows, & Henry) background survey Maintenance
Sample size'
Other'
Total
76,7 92.4
23.3 7.6
100 100
87.5 96.7
12.5
3.3
100 100
82.5 89.4
17.5 10.6
100 100
86.8 98.3
13.2
100 100
1.7
100
99.7
• Mean value = 90.0. " Mean value = 10.0.
there were no differences between studies in criterion and test reliability and range restriction, validity generalization conclusions were unchanged. Thus, as a practical matter, it is of little consequence within broad ranges what distributions of these artifacts are assumed.
Validity Generalization Results bearing on the question of the generalizability of validities are shown in Table 6. For each job group-test type combination, Table 6 shows the total sample size on which the analysis was based and the number of
Table 6 Validity Generalization Results
Test type and job group Mechanical comprehension Operator Maintenance Chemical comprehension Operator Maintenance General intelligence Operator Maintenance Arithmetic reasoning Operator Maintenance RBH (Richardson. Bellows, & Henry) background survey Maintenance
n
No. validity coefficients
1,800 706
Bayesian prior distribution P*
SD'p
90% credibility values for r
18 12
.33 .33
.12 .17
.19 .11
1,138 605
13 10
.30 .25
.05 .00
.24 .25
1,486 821
16 13
.26 .30
.19 .18
.01 .06
1,067 628
12 11
.26 .15
.20 .16
.01 -.05
505
8
-.05
.15
-.25
' Estimated true validities. These figures are corrected for range restriction and criterion unreliability but not for test unreliability. These are fully corrected mean rs attenuated by the square root of mean test reliability. Thus a mean test reliability of .75 or .76 (see Table 2) is assumed. Means and credibility values thus apply to tests having these reliabilities. Tests with higher reliabilities would have larger true validities and larger credibility values, and vice versa.
268
F. SCHMIDT, J. HUNTER. AND J. CAPLAN
validity coefficients. Also shown are the mean and standard deviation of the prior Bayesian validity distribution, along with the 90% credibility values. In each distribution, only 10% of the validities are smaller than the 90% credibility value. The mean of the prior distribution is the mean validity coefficient corrected for range restriction and criterion unreliability (using mean values of both) but not for test unreliability. Thus these prior distributions assume mean test reliabilities of either .75 or .76 (see Table 2). The standard deviation of the prior distribution is the residual standard deviation corrected as described in Schmidt, Gast-Rosenberg, and Hunter (1980). The residual standard deviation is the square root of the variance of observed validities remaining after variance due to artifacts has been subtracted. In the case of the mechanical comprehension and chemical comprehension tests for both job groups, the mean validities and the credibility values are substantial, thus demonstrating that validities are generalizable across settings and organizations. Both tests can confidently be used to select applicants for these two jobs. The demonstration that validity is generalizable for both job groups in the case of the mechanical comprehension and chemical comprehension tests is rendered more robust by the fact that positions included in each job group category were not identical in all tasks performed. The results obtained permit validity to be generalized to the same class of jobs from which the results were derived. Since this class is broader than it would have been had only jobs with identical task makeup been included in each job category, the results can be generalized to a larger number of jobs. Until recently it was believed that inclusion of a wider range of jobs in a validity generalization analysis would significantly reduce the chances of obtaining generalizable results. Recent research results, however, based on unusually large samples, have shown that the impact of differences between jobs in task makeup on the validity coefficients of cognitive tests is trivial in magnitude (Schmidt, Hunter, & Pearlman, 1981). Using the data from the present
study, along with additional data, Callender and Osburn (1981) have obtained similar findings. They have shown that combining validity data across the two job groups examined in this study, plus one other job group (laboratory), does not lead to enlarged estimates of SZ>P, as would be expected if job groups in fact moderated test validities. Consider now the results for the RBH background survey used within the maintenance job group. The reader will recall from our discussion of Table 4 that 75% of the variance of the validity coefficients in this distribution was accounted for by statistical artifacts, thus allowing rejection of the situational specificity hypothesis. The mean of this distribution, however, is very near zero; in fact, it is negative. These findings represent a case in which it is the lack of validity that can be generalized. Validity generalization is justified; we can predict that in a new setting this instrument will have at best a negligible validity for predicting job performance in the maintenance job group. Our confidence, however, must be tempered by the fact that this cell contains only eight validity coefficients and is therefore more susceptible to sampling error than are the other job-test combinations. The evidence for the generalizability of validity is not as strong for the remaining test-job combinations, but there is substantial evidence of generalizability. In three of these four cases, over 90% of the values in the estimated true-validity distribution lie in the positive range, allowing the conclusion that positive validity can be expected in new settings and organizations more than 9 times out of 10. According to the decision rule adopted by Callender and Osburn (1981), such a conclusion provides a sufficient basis for validity generalization. There are substantial differences in the results for arithmetic reasoning tests for the two job groups. Estimated true validity is higher for the operator than for the maintenance jobs (.26 vs. .15). Because of the relatively large standard deviation of true validities for operators, however, the true validity of .26 is associated with a relatively small 90% credibility value (.01). The abnormally low mean validity for the maintenance group is primarily due to a single
TWO OCCUPATIONS
269
aberrant value of —.41 in the distribution of porated into the prior distributions as they observed validities. The data necessary to become available. Such additional data are check the numerical accuracy of this coef- now being collected. ficient could not be obtained. The chance occurrence of the value of —.41 is less than Determining Battery Validities 1 out of 5,000 in the absence of computational errors. Had this highly suspicious The results presented above apply to incoefficient not been included, the estimated dividual tests. In practice, several tests are true validity of arithmetic reasoning tests for frequently used in the selection of employthe maintenance job would have been ap- ees. Estimation of battery validities requires proximately .21 (vs. .26 for operators). estimates of true validities and estimates of A word about the use of Bayesian prior predictor intercorrelations in the applicant distributions in validity generalization is ap- pool. Estimates of true validities for the prepropriate. The best estimate of true validity dictors examined in this research can be in a new setting involving the same job and found in Table 6; they are the means of the a measure of the same construct is not the Bayesian prior distributions. Correlations 90% credibility values but the mean of the among three of the four predictors in Table corrected prior distribution. In this study, 6 were computed on two samples of applithese means were computed assuming test cants. (The two jobs share a common apreliabilities of either .75 or .76. If, in a given plicant pool.) Correlations for general intelsituation, the specific test used has higher ligence tests could not be obtained. Both reliability, both the mean and the credibility samples were large (981 and 922), and as values would be correspondingly higher. The expected, the correlation matrices were very procedure for using these prior distributions similar. The two matrices were averaged to in validity generalization applications is as give optimal estimates of predictor intercorfollows. The mean and standard deviation relations. The resulting matrix, along with of the prior distribution is corrected for test validities from Table 6, is shown in Table unreliability as well as for criterion unreli- 7. The validities of various possible batteries ability and range restriction. Then, after the were calculated using both unit and beta reliability of the specific test one is consid- weights. The results are shown in Table 8. ering using has been determined, the mean All batteries for both jobs have validities and the standard deviation of the fully cor- large enough to have substantial practical rected prior distribution is attenuated by the value (Schmidt, Hunter, McKenzie, & square root of this reliability. This procedure Muldrow, 1979) regardless of the method allows the corrected prior distribution to be of weighting predictors. For the operator job tailored to the specific test at hand, produc- group, there is virtually no difference in bating more accurate estimates of both expected tery validities produced by unit and beta true validity and the credibility values. weights, as would be expected based on preThe logic of Bayesian statistical models vious research (Schmidt, 1971). When is based on the provision for continual mod- rounded off to two places, validities are the ification of prior distributions as additional same for unit and beta weights for all batinformation becomes available. Our Bayes- teries except one, and in that case the difian model of validity generalization is no ference is only .01. Further, there is little exception in this regard. Where validity gen- difference in validity between different bateralization prior distributions are based on teries. The battery composed of mechanical large numbers of coefficients (e.g., 50 or comprehension and chemical comprehenmore), it is very likely that additional valid- sion—the two test types for which support ity information would have only a trivial for validity generalization is greatest—shows effect on the prior distribution. Because the a level of validity equal to or greater than prior distributions in this study, however, are any other battery. based on fewer coefficients than was typiFor the maintenance job group, beta cally the case in our previous studies, addi- weights appear to produce higher validities tional validity coefficients should be incor- than unit weights for all four batteries. In
270
F. SCHMIDT, J. HUNTER, AND J. CAPLAN
Table 7 Observed Test Correlations and Validities* 1
Test 1. 2. 3. 4. 5.
Mechanical comprehension Chemical comprehension Arithmetic reasoning Job performance: operator Job performance: maintenance
1.00
.66 .57 .34 .34
.66
.57 .60
.60 .30 .25
1.00
1.00
.33 .30 .26
.27 .15
1.00
.33 .25 .15 —
—
1.00
Note.— = not reported. ' Validities are taken from Table 6.
many organizations, beta weights are less likely to be used because of the inconvenience of applying them during selection. If we consider only unit weights, Table 8 shows that batteries made up of mechanical comprehension and chemical comprehension tests can be expected to have validities at least as high as other possible batteries. Further, the validity of the unit-weighted battery is only .02 less than that produced by beta weights. Again, these are the two test types for which validity generalization support is the strongest. Estimating the Relative Contributions of Different Construct True Scores to Job Success In planning future research on selection and modifications of existing selection instruments, it is useful to examine predictorcriterion relationships at the true-score level. This can be done by examination of beta weights corrected for predictor unreliability. Beta weights as typically computed often do not tell the whole story. For example, sup-
pose that only Ability A actually is used in performance on a given job. Suppose further that the researcher also measures Ability B and that the two abilities are substantially positively correlated (as is usually the case). Assume that both abilities are measured with a reliability of .75. Because Ability B is correlated with Ability A, Ability B will show a positive validity. Moreover, Ability B will probably receive a positive beta weight even though it does not contribute directly to job performance. Because Ability A is not measured with perfect reliability, partialing the measure of Ability A out of Ability B does not partial all of the effect of Ability A out of Ability B. Therefore the measure of Ability B receives a positive beta weight because even after partialing it contributes to the measurement of Ability A. Thus beta weights computed using observed correlations among predictors typically do not provide an accurate picture of events at the true-score level. That is, they do not provide an accurate picture of the relative importance of the abilities themselves, as opposed to imperfect measures of
Table 8 Validities of Four Possible Test Batteries Against Job Proficiency Using Unit and Beta Weights Job group Operator
Maintenance
Battery
Beta'
Unit
Beta'
Unit
Mechanical comprehension, chemical comprehension, A arithmetic reasoning Mechanical comprehension & chemical comprehension Mechanical comprehension & arithmetic reasoning Chemical comprehension & arithmetic reasoning
.35 .35 .35 .32
.35 .35 .34 .32
.35 .34 .34 .25
.29 .32 .28
' All validities shrunken using Browne's (1975) formula. See also Cattin (1980).
.22
271
TWO OCCUPATIONS
Table 9 Estimated
True-Score Correlations Among Tests and Job Proficiency Measures Measure
1. 2. 3. 4. 5.
Mechanical comprehension Chemical comprehension Arithmetic reasoning Job performance: Operator Job performance: Maintenance
2
3
4
5
.88 1.00 .80 .35 .29
.76 .80 1.00 .31 .17
.38 .35 .30 1.00
.38 .29 .17
1 1.00 .88 .76 .39 .39
1.00
Note. — = not reported.
the abilities (test scores). Estimates of actual true-score contributions of different abilities can be obtained by first correcting all correlations for predictor unreliabilities and then computing beta weights. The correlations from Table 7 corrected for predictor unreliability are shown in Table 9. Beta weights for different test batteries computed using both observed and corrected correlations are shown in Table 10. The results in the case of the operator job group are very clear cut. For all batteries containing mechanical comprehension, the truescore beta weight for this ability is larger than the obtained-score beta weight, and true-score beta weights for other abilities are smaller than obtained-score beta weights. This means that the effect of unreliability in measures of these abilities is one of under-
estimating the actual importance of mechanical ability in this job and overestimating the importance of other abilities. At the level of actual abilities, when mechanical ability is held constant, the other two abilities make only small contributions to job performance. The importance of their contribution is overestimated by obtained-score beta weights. These results indicate that it would be most fruitful to concentrate on increasing the reliability of measures of mechanical comprehension. The use of more reliable measures of mechanical comprehension would reduce the relative importance of chemical comprehension, although the latter might continue to make a useful contribution as long as the reliability of the measure of mechanical comprehension is less than perfect.
Table 10 Obtained and True-Score Beta Weights for Four Test Batteries Job group Operator
Maintenance
Battery
Obtained score
True score
Obtained score
Battery 1 Mechanical comprehension Chemical comprehension Arithmetic reasoning
.233 .110 .062
.372 .000 .023
.338 .082 -.092
.703 -.101 -.289
Battery 2 Mechanical comprehension Chemical comprehension
.252 .134
.378 .014
.310 .045
.634 -.275
Battery 3 Mechanical comprehension Arithmetic reasoning
.284 .098
.372 .024
.377 -.065
.637 -.320
Battery 4 Chemical comprehension Arithmetic reasoning
.225 .125
.288 .077
.250 .000
.443 -.189
True score
272
F. SCHMIDT, J. HUNTER, AND J. CAPLAN
The actual pattern of results in the case of the maintenance job group is similar, but interpretation is complicated by the presence of negative beta weights. Again, for all batteries containing mechanical comprehension, the true-score beta weight for this ability is larger than the obtained-score beta weight, indicating that the effect of unreliability in the ability measures is one of underestimating the importance of mechanical ability. But for this job group, the absolute values of beta weights for most other abilities increase as we move from obtainedscore to true-score betas. This increase, however, is deceptive. These large negative weights suggest that the other abilities might act as suppressor variables; they do not mean that, holding mechanical ability constant, the other abilities make a direct contribution to the criterion. The contribution of a suppressor variable to prediction is an indirect one, resulting from its ability to "suppress" or partial out nonvalid variance in a valid predictor. Thus, for the maintenance job group as for the operator job group, the contribution to job performance of abilities other than mechanical comprehension is smaller in reality (i.e., at the true-score level) than obtained-score beta weights would suggest. For the maintenance job group, even more so than for the operator job group, the most critical ability appears to be mechanical comprehension. As in the case of the operator job group, these results indicate that it would be fruitful to concentrate on increasing the reliability of currently used measures of mechanical comprehension. (These results also suggest that if highly reliable measures of all three abilities could be developed, chemical comprehension and arithmetic reasoning could possibly be used as suppressor variables. In the past, however, most employers have rejected the use of suppressor variables for a number of reasons, including the possibility that applicants could learn how scores were used and then deliberately get low scores on the test.) Reference Note I. Schmidt, F. L., Hunter, J. E., & Caplan, J. R. Validity generalization: Results for two occupations in
the petroleum industry. Unpublished manuscript, American Petroleum Institute, Washington, D. C , 1979.
References Brogden, H. E., & Taylor, E. K. A theory and classification of criterion bias. Educational and Psychological Measurement, 1950, 10. 159-186. Browne, M. W. Predictive validity of a linear regression equation. British Journal of Mathematical and Statistical Psychology, 1975, 28. 79-87. Callender. J. C , & Osburn, H. G. Development and test of a new model for validity generalization. Journal of Applied Psychology, 1980, 65. 543-558. Callender, J. C , & Osburn, H. G. Testing the constancy of validity with computer-generated sampling distributions of the multiplicative model variance estimate: Results for petroleum industry validation research. Journal of Applied Psychology, 1981, 66, 274-281. Cattin, P. The estimation of the predictive power of a regression model. Journal of Applied Psychology, 1980, 65. 407-414. Pearlman, K., Schmidt, F. L., & Hunter, J. E. Validity generalization results for tests used to predict job proficiency and training success in clerical occupations. Journal of Applied Psychology, 1980,65. 373-406. Schmidt. F. L. The relative efficiency of regression and simple unit predictor weights in applied differential psychology. Educational and Psychological Measurement, 1971, 31. 699-714. Schmidt, F. L., Gast-Rosenberg, I., & Hunter, J. E. Validity generalization results for computer programmers. Journal of Applied Psychology, 1980, 65, 643661. Schmidt, F. L., & Hunter, J. E. Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 1977, 62. 529-540. Schmidt, F. L., Hunter, J. E., McKenzie, R., & Muldrow, T. The impact of valid selection procedures on workforce productivity. Journal of Applied Psychology, 1979, 64. 609-626. Schmidt, F. L., Hunter, J. E., & Pearlman, K. Task differences as moderators of aptitude test validity in selection: A red herring. Journal of Applied Psychology, 1981, 66, 166-185. Schmidt, F. L., Hunter, J. E., Pearlman, K., & Shane, G. S. Further tests of the Schmidt-Hunter Bayesian validity generalization procedure. Personnel Psychology, 1979, 32. 257-281. Schmidt, F. L., Hunter, J. E., & Urry, V. Statistical power in criterion-related validation studies. Journal of Applied Psychology, 1976, 61. 473-485. U.S. Department of Labor. Dictionary of occupational titles (4th ed.). Washington, D.C.: U.S. Government Printing Office, 1977. Wolins, L. Responsibility for raw data. American Psychologist, 1962, /7, 657-658.
273
TWO OCCUPATIONS Appendix Job Codes From the Dictionary of Occupational Titles (U.S. Department of Labor, 1977) Operator job group 029.261 - 022 540.462 - 010 546.382 - 010 549.132 - 030 549.260 - 010 542.362 - 014 542.562 - 010 549.360 - 010 549.362 - 0 1 0 549.362 - 014
549.684 - 010 549.687 - 018 559.382 - 018 891.687 - 022 910.384-010 914.167 - 014 914.382 - 010 914.384 - 010 914.667 - 010
Maintenance job group 600.280-014 600.280-022 620.281 - 046 710.281 - 026 804.281 - 0 1 0 805.261 - 014 811.684 - 014
812.682-010 829.281 - 0 1 4 860.281 - 010 862.381 - 0 1 8 863.381 - 0 1 4 921.260 - 0 1 0
Received July 17, 1980 •