Task Differences as Moderators of Aptitude Test Validity in Selection ...

3 downloads 0 Views 2MB Size Report
based on job proficiency or performance cri- teria and 582 are based on criteria of train- ing success. The data for each criterion type were treated separately in ...
In the public domain

Journal or Applied Psychology 1981, Vol.66, No. 2, 166-185

Task Differences as Moderators of Aptitude Test Validity in Selection: A Red Herring John E. Hunter

Frank L. Schmidt U.S. Office of Personnel Management,

Michigan State University

Washington, D.C., and George Washington University

Kenneth Pearlman U.S. Office of Personnel Management Washington, D.C. This article describes results of two studies, based on a total sample size of nearly 400,000, examining the traditional belief that between-job task differences cause aptitude tests to be valid for some jobs but not for others. Results indicate that aptitude tests are valid across jobs. The moderating effect of tasks is negligible even when jobs differ grossly in task makeup and is probably nonexistent when task differences are less extreme. These results have important implications for validity generalization, for the use of task-oriented job analysis in selection research, for criterion construction, for moderator research, and for proper interpretation of the Uniform Guidelines on Employee Selection Procedures. The philosophy of science and methodological assumptions historically underlying belief in the hypothesis that tasks are important moderators of test validities are examined and critiqued. It is concluded that the belief in this hypothesis can be traced to behaviorist assumptions introduced into personnel psychology in the early 1960s and that, in retrospect, these assumptions can be seen to be false.

Jobs differ in the tasks that make them up. Within a broad occupational area like clerical work, jobs can be classified into relatively task-homogeneous job families. Many industrial psychologists believe that taskbased differentiations of this sort are important because they believe that such task differences moderate aptitude test validities. Further, they believe that this moderating effect is substantial. That is, they hold that the effect will often be expressed not in differences of a few correlation points but rather will be large enough that an aptitude test with substantial validity for one grouping of tasks (job) may be "invalid" (i.e., have a near-zero validity) for another grouping of tasks (job). This is the sense in which we use the term moderator effect in this article. As a result of this belief, there has been in The opinions expressed herein are those of the authors and do not necessarily reflect the official policy of the U.S. Office of Personnel Management. Requests for reprints should be sent to Frank L. Schmidt, Personnel Research and Development Center, U.S. Office of Personnel Management, 1900 E Street, N.W., Washington, D.C. 20415.

recent years a heavy emphasis in selection psychology on the use of job analysis methods that focus on job tasks, duties, or specific behaviors (Prien & Ronan, 1971). This trend has been reflected not only in the published literature but has also influenced the content of the Uniform Guidelines on Employee Selection Procedures (U.S. Equal Employment Opportunity Commission, 1978). The assumption that job tasks and behaviors moderate the validity of employment tests appears plausible on the surface, and, perhaps because of this very plausibility, little need has been felt for empirical evidence to support it. We could find no empirical studies focusing on the validity of this proposition. This article presents the results of two very large-sample studies showing that tasks have little or no moderating effect on the validities of measures of traditional ability constructs. Since these results contradict beliefs that are widespread among personnel psychologists, they require some explanation. We therefore examine the intellectual history of personnel psychology and present 166

TASK DIFFERENCES

an interpretation in which epistimological and philosophy of science influences from behaviorism and logical positivism are shown to have led to assumptions, interpretations, and conclusions that in retrospect can be seen to be false. Study 1 As part of our research program examining the generalizability (transportability) of validity (Pearlman, Schmidt, & Hunter, 1980; Schmidt, Gast-Rosenberg, & Hunter, 1980; Schmidt & Hunter, 1977; Schmidt, Hunter, Pearlman, & Shane, 1979), we have developed a validity data bank for clerical occupations that contains nearly 3,400 validity coefficients for a wide variety of predictors against measures of overall job proficiency and overall training success. Tests were classified using a system adapted from Ghiselli (1966, pp. 15-21) and Dunnette (Note 1). Ten general categories of test types were established, most of which represent a construct or ability factor found in the psychometric literature (e.g., verbal ability, quantitative ability, perceptual speed). Categories for general mental ability tests (consisting of verbal, quantitative, and abstract reasoning or spatial ability components) and so-called clerical aptitude tests (consisting of verbal, quantitative, and perceptual speed components) were included because of their relatively common use in clerical selection, even though they can be decomposed (e.g., using factor analysis) into more homogeneous constituent dimensions. Two of the 10 test type categories—motor ability tests and performance tests—were excluded from the present analysis because they are considerably more factorially complex and cannot be considered as constructs in the same sense as the other 8 test types. Clerical jobs were classified based on a modification of the Dictionary of Occupational Titles (DOT) classification system (Pearlman, 1979; U.S. Department of Labor, 1977). This system groups clerical jobs into five true job family categories, one miscellaneous category, and two residual categories (developed to handle clerical occupations that were not sufficiently specified in the original study to permit definitive clas-

167

sification and samples representing two or more different job families). Only the five true (i.e., relatively task-homogeneous) job families are considered in this study. These five job families are shown in Table 1. Both the test and job classification systems are fully described elsewhere (Pearlman, 1979; Pearlman et al., 1980). We conducted an extensive search for both published and unpublished validity studies of clerical jobs. This search produced 3,368 validity coefficients representing 698 independent samples, approximately two thirds of which came from unpublished studies. Of these 3,368 coefficients, 2,786 are based on job proficiency or performance criteria and 582 are based on criteria of training success. The data for each criterion type were treated separately in the present study. If task differences between jobs moderate test validities, we would expect any given test type to show substantial validity differences across the five task-homogeneous job families. Similarly, we would expect the mean within-family standard deviation (SD) of validity coefficients to be substantially smaller than the SD of validity coefficients pooled across all job families. If neither of these effects is observed, then the evidence is strong that between-family task differences have little or no effect on test validities. Method The mean observed (uncorrected) validity coefficient (?) was computed for each test type — job family combination for which there were 10 or more validity coefficients. The mean observed validity coefficient for each test type pooled across all job families was similarly computed. We also computed estimates of true validity (p) corresponding to both types of mean uncorrected validities. These are mean observed validities corrected for average assumed values of range restriction and criterion unreliability. (Average assumed range restriction was to an SD of 5.945 from an unrestricted SD of 10.0; average assumed criterion reliability was .60 for validities based on proficiency criteria and .80 for those based on training criteria. See Pearlman et al., 1980, for a full explanation of the use of these values.) For each test type -job family combination for which there were more than 6 coefficients, the SD of observed validity coefficients was computed. (The lower minimum value reflects the fact that the standard error is smaller for the SD than for the mean, sample size being equal.) Likewise, the SD was computed for validities for each test type pooled across job families.

168 Results and Discussion

F. SCHMIDT, J. HUNTER, AND K. PEARLMAN

timate of true validity for the pooled data and separately for each job family (as deResults for proficiency data are shown in fined in Table 1). Table 2 also shows the Tables 2 and 3. Table 2 shows the mean number of validities on which each mean is observed validity and the corresponding esbased. These data clearly contradict the hypothesis that task, duty, or behavior differTable 1 ences between jobs moderate test validities. Composition of Five Clerical Job Families in For each construct, mean observed validities Study 1 and estimated true validities are very similar across the different job families. Tests valid A. Stenography, typing, filing, & related for one job family are valid for all other job occupations (20) families. Mean observed validities for the Secretaries various job families differ only trivially from Stenographers the coefficient computed on the pooled data. Typists & typewriting machine operators Interviewing clerks The tiny differences in Table 2 would be File clerks expected on the basis of sampling error Duplicating-machine operators & tenders alone. Mailing & miscellaneous office machine operators The results shown in Table 3 are a difStenography, typing, filing, & related occupations NEC ferent way of looking at these same data. For each test type, Table 3 shows the SD of B. Computing & account-recording occupations (21) validities pooled across job families, the Bookkeepers & bookkeeping-machine operators mean within-family SD, and the difference Cashiers & tellers between these two values. Also shown is the Electronic & electromechanical data processors number of validities on which both types of Billing & rate clerks SD are based. If the hypothesis that task Payroll, timekeeping, & duty-roster clerks differences affect validities is correct, then Account-recording machine operators NEC Computing & account-recording occupations NEC for each test type, the mean within-family SD of validity coefficients for each test type C. Production & stock clerks and related should be substantially smaller than the SD occupations (22) of pooled validity coefficients for that test Production clerks type. The data in Table 3 do not support this Shipping, receiving, stock, & related clerical prediction. The average increase in the SD occupations of validity coefficients resulting from pooling Production & stock clerks & related occupations NEC across job families is a miniscule .013 correlation points. A difference of this size is D. Information & message distribution expected on the basis of sampling error occupations (23) alone. Hand delivery & distribution occupations The results for validities against measures Telephone operators of overall success in training are quite simTelegraph operators ilar. Table 4 shows that for each construct, Information & reception clerks mean observed validities and estimated true Accommodation clerks & gate & ticket agents Information & message distribution occupations NEC validities are very similar across different job families. With the possible exception of perE. Public contact & clerical service occupations ceptual speed tests for job family E (based (241-248) on only 10 coefficients), tests valid for one Investigators, adjusters, & related occupations job family are valid for all other job families. Government service clerks NEC The pattern of findings in Table 5 is strikMedical service clerks NEC ingly similar to results in Table 3. For trainAdvertising service clerks NEC Transportation service clerks NEC ing success criteria, as well as proficiency criteria, the mean within-family SD of vaNote, Figures in parentheses are Dictionary of Occupational Titles (Pearlman, 1979; U.S. Department of lidity coefficients is very close to the SD comLabor, 1977) occupational division/group numbers. puted across the total group of coefficients. NEC = not elsewhere classified. The average increase in the SD resulting

Table 2 Comparison of Overall Mean Validities and Job Family Means for Eight Test Types Used in Clerical Selection (Proficiency Criteria) Individual job families

Test type General mental ability Verbal ability Quantitative ability Reasoning ability Perceptual speed Memory Spatial/mechanical Clerical aptitude Totals

n 10,564 27,352 25,850 5,377 51,361 6,278 9,240 5,989 142,011

7

P

.24 .19 .24 .21 .22 .19 .14 .25

.50

.40 .50 .44 .47 .39 .30 .51

No. rs 144 355

333 80 702 102 107 94 1,917

r

P

.24 .19 .23 .18 .22 .18 .09 .24

.50 .39 .49 .38 .45 .38 .20 .50

C

B

A

Pooled validity coefficients

No. rs 76

215 155 36 368 49 38 63 1,000

No.

r

P

rs

.23 .20 .25 .32 .24 .42 .20 .26

.49 .41 .52 .63 .50 .43 .42 .53

47 97 121 29 251 39 47 26 657

r .18 .30 .14 .22 .21 .23 —

P .37 .60 .31 .45 .44 .48 —

D No. rs

E No.

No.

rs

r

p

rs

a

a

28 33 10 50 11 12

"

.21 — .21 — .18 — — —

.43 — .45 — .39

10 * 17 • 23

r

p

a a

.19

.40

10 a a

a

a

144

10

a

a *

50

Note. Job families based on Dictionary of Occupational Titles (Pearlman, 1979; U.S. Department of Labor, 1977) occup;itional j;roupings: A = Division 20, B = Division 21, C = Division 22, D = Division 23, and E = Groups 241-248. — = not reported. a Cells with less than 10 validities (means were computed only for cells with 10 or more validities).

I 11 •n w 70 w r> m

170

F. SCHMIDT, J. HUNTER, AND K. PEARLMAN

Table 3 Standard Deviations of Observed Validity Coefficients Work (Proficiency Criteria)

Across and Within Job Families of Clerical Within families

Across families

SD

No. validities

Mean

Test type

sir

No. validities

Difference

General mental ability Verbal ability Quantitative ability Reasoning ability Perceptual speed Memory Spatial/mechanical Clerical aptitude

.166 .161 .142 .147 .161 .144 .145 .174

144 355 333 80 702 102 107 94

.136" .133 .130 .145° .148 .140' .129b .179d

140 355 333 75 702 99 103 89

.030 .028 .012 .002 .013 .004 .016 -.005

.155

1,917

.143

1,896

.013

Ms and totals

* Based on five Dictionary of Occupational Titles (DOT; Pearlman, 1979; U.S. Department of Labor, 1977) job families unless otherwise specified; SDs included only if the cell had six or more validities. bBased on four DOT job families. 'Based on three DOT job families. d Based on two DOT job families.

from pooling validities across job families is a trivial .015. Thus, the results for training criteria also contradict the hypothesis that task, duty, or behavior differences between jobs moderate test validities. If large differences in duties and behaviors of the kind that distinguish these five job families from each other have little if any effect on validity coefficients, it is highly unlikely that fine-grained difference in tasks within job families will have such an effect. Indeed, this is precisely the meaning of the validity generalization results reported in Pearlman et al. (1980), based on the same data set as used in this study. That study found that most of the variance of validity coefficients within these job families was accounted for by only four statistical artifacts. This finding indicates that there are no moderator effects within job families large enough to be of any practical or theoretical significance. Thus, there appears to be no plausible way the data in Study 1 can be interpreted so as to be consistent with the belief that task differences moderate validities. Study 2 Having shown that task differences between occupational areas in clerical work do not affect test validities, we asked ourselves

whether the task differences existing between entirely different kinds of jobs would be potent enough to produce nontrivial moderating effects. Fortunately, we had available large-sample data from military validity studies that were ideal for answering this question. These data are from research on the Army Classification Battery (ACB; Helme, Gibson, & Brogden, Note 2), which consists of 10 widely varying subtests, including achievement or information tests as well as aptitude tests. These data consisted of two entirely independent samples of approximately 10,500 individuals each. Each of these samples contained validity coefficients for each of the ACB tests for each of 35 jobs (listed in Table 6). The jobs were the same in both samples, and the average « per job was approximately 300 in both samples. Included were such jobs as welders, cooks, clerks, and personnel administrators. Clearly, such jobs vary widely in their task makeup. All validity coefficients were computed against measures of success in training and (in the original study) all were corrected for range restriction but not for criterion unreliability. These validity coefficients are given in Appendix A of Schmidt and Hunter (1978). The question to be examined was not whether validities for a given test vary across jobs. A previous study using these same data

Table 4 Comparison of Overall Mean Validities and Job Family Means for Seven Test Types Used in Clerical Selection (Training Criteria) Individual job families Pooled validity coefficients Test type General mental ability Verbal ability Quantitative ability Reasoning ability Perceptual speed Spatial/mechanical Clerical aptitude Totals

N 31,535 44,142 50,415 4,928 38,365 41,942 15,539 226,866

D

r

P

No. rs

r

P

.43 .39 .43 .22 .22 .21 .38

.70 .64 .70 .39 .39 .37 .64

61 97 102 25 153 63 34

.52 .48 .51 .16 .26 .27 —

.80 .75 .79 .29 .46 .47 —

535

No. rs

r

f>

16 16 26 10 28 19

.39 .37 .40 — .21 .20 .37

.66 .62 .66 — .38 .36 .62

a

115

No. rs 17 38 36 a

61 21 16 189

r

P

.43 .35 .37 — .23 .14 —

.70 .60 .62 — .41 .26 —

No. rs 13 26 22 a

29 16 a

106

r —

.26 .26 — .20 — —

P — .46 .46 — .36 — —

No. rs

f

p

a

.a

10

11a

25 a

No. rs a





.01

a a

.02

10 a

a

a

46

10

Note. Job families based on Dictionary of Occupational Titles (Pearlman, 1979; U.S. Department of Labor, 1977) occupational groupings: A > • Division 20, B = Division 21, C = Division 22, D = Division 23, and E = Groups 241-248. — = not reported. " Cells with less than 10 validities (means were computed only for cells with 10 or more validities.

§ o

172

F. SCHMIDT, J. HUNTER, AND K. PEARLMAN

Table 5 Standard Deviations of Observed Validity Coefficients Across and Within Job Families of Clerical Work (Training Criteria) Across families

Within families

Test type

SD

No. validities

Mean SD*

No. validities

Difference

General mental ability Verbal ability Quantitative ability Reasoning ability Perceptual speed Spatial/mechanical Clerical aptitude

.116 .119 .118 .131 .146 .128 .107

61 97 102 25 153 63 34

.112 .116 .103 .091" .140 .111" .085°

61 97 102 25 153 56 24

.004 .003 .015 .040 .006 .017 .022

.124

535

.108

518

.015

Ms and totals

" Based on five Dictionary of Occupational Titles (DOT: Pearlman, 1979; U.S. Department of Labor, 1977) job families unless otherwise specified; SDs included only if the cell had six or more validities. bBased on three DOT job families. 'Based on two DOT job families.

(Schmidt & Hunter, 1978) had already shown that there were reliable differences between jobs in validities for all but perhaps one of the ACB tests, Radiocode Aptitude. That study, however, did not determine the magnitude of these effects. The present study determined these magnitudes to ascertain whether these effects were large enough to satisfy the definition of moderator effect used in this article; that is, are tests valid for some jobs but invalid for others. Method For each test, the 35 validity coefficients from Sample 1 were correlated with the 35 validity coefficients from Sample 2. The resulting correlation estimates the reliability of the validity coefficients. That is, for each test, this coefficient estimates the proportion of the betweenjob variance in validity coefficients that is true variance. The product of this coefficient and the observed between-job variance therefore provides an estimate of true between-job variance in validity coefficients for each test. The square root of this value is the estimate of the true SD. This procedure provides a means of partialing the effects of sampling error out of estimates of between-job validity differences.

Results and Discussion Results obtained in each of the two samples are shown in Table 7. The left hand column shows the reliability of validity coefficients, that is, the correlation between validity estimates in Samples 1 and 2. These are substantial for all tests except Radiocode

Aptitude, thus indicating that differences between jobs in the validity of a given test do exist. The average value for the true SD is. 1081 in Sample 1 and .1202 in Sample 2, for an average of .1142. This is the SD of validity coefficients across jobs, where those coefficients have been corrected for sampling error and range restriction. These values are remarkably small considering the range of jobs in the sample. This finding is unchanged if we distinguish between aptitude and achievement (i.e., information or knowledge) tests. The first six tests in Table 7 are aptitude tests; the remaining four measure specific information. Our primary interest is in aptitude tests, that is, measures of constructs. We would expect between-job differences in validity to be greater for tests of specific information. For the six aptitude tests in Table 7, the average value for the true SD is .1005. For our purposes, however, a more informative figure might be obtained by omitting the Radiocode Aptitude Test, which is very specific in focus and is not representative of widely used aptitude tests. With the Radiocode Aptitude Test omitted, the mean true SD is .1123, essentially identical to the overall mean of .1142. For the four information tests, the mean value for the true SD is . 1346, about .02 higher than the mean value for the aptitude tests. The differences between jobs in validity

173

TASK DIFFERENCES

for a given aptitude test are so small that very large sample sizes are required to detect them reliably. For example, using a onetailed test, the sample size required for each job for statistical power of .90 is 1,420 if the two jobs differ by one SD, that is, by .11. For a two-tailed test, this figure is 1,736. If the two jobs differ by two SDs, or .22, the required sample sizes are 355 in each job for a one-tailed test and 435 in each job for a two-tailed test. Differences as large as two SDs will occur less than one time out of seven, but even these smaller sample sizes are rarely available. Thus researchers interested in these tiny differences would rarely be able to detect them reliably in individual studies. An average value for the true SD of .11 for conventional aptitude tests means that, for the typical test, 68% of the (range-restriction-corrected) validities for different jobs lie within ±.11 of the mean. Ninety percent lie within ±.18 of the mean. The average validity for all tests across the two samples is .45. Thus, for the average test, the validity lies between .34 and .56 for 68% of the jobs and between .27 and .63 for 90% of the jobs. Confidence intervals for specific tests differ slightly from the average figures. For example, the validity for the arithmetic reasoning test lies between .44 and .69 for 68% of the jobs and between .35 and .78 for 90% of the jobs. But in all cases, the range is small considering that the jobs range in use of arithmetic from cook to multichannel microwave repair.1 In our work on validity generalization (Pearlman et al., 1980; Schmidt, Gast-Rosenberg, & Hunter, 1980; Schmidt & Hunter, 1977; Schmidt, Hunter, Pearlman, & Shane, 1979), we have employed a Bayesian model to determine when validity generalization is justified for a given test type within a single job type or task-homogeneous job family. If more than 75% of the variance of observed validity coefficients is accounted for by statistical artifacts, we conclude that validity generalization is justified by virtue of no variation in true validity. If less variance is accounted for, the decision rule produced by this model states that validity generalization is justified only if a substantial majority, say 90%, of validity estimates fall

Table 6 35 Army Jobs in Study 2

MOS

Job title

271-2

Fixed Station Radio Repair Microwave-Multichannel Repair Radar Repair Field Radio Repair Field Wireman Fire Control Instrument Repair Ammunition Supply Artillery Mechanic-Light Weapons Small Weapons Mechanic Turret Artillery Repair Welder Machinist Dental Laboratory Automotive Mechanic Track Vehicle Repair Armor Track Vehicle Maintainance Fuel & Electrical Systems Repair Track Vehicle Chassis Rebuild Clerk Stenography Postal Operations Personnel Administration Personnel Management (Enlisted) Advanced Army Administration Machine Accounting Ordnance Storage Specialist Medical Airman, Advanced Medical Technician Dental Assistant Cook Military Police (Enlisted Advisor) Disciplinary Guard (Enlisted) Criminal Investigation Radio Operator (Intermediate Speed) Radio Operator (High Speed)

281 282 296 320 403 411 421 422 424 442 443 452 631 632 632 634 635 710 712 714 716 716 717 753 763 911 912 917 941 951 952 953 051 052

Sample 1

Sample 2

310

323

216 242 280 330

236 237 481 330

214 767

214 702

196 418 183 236 296 100 154 448

205 430 186 243 314 121 198 430

248

248

523

522

103 311 569 293 286

114 340 569 295 286

556

556

406 383

401 355

291 308 271 367 305

274 200 200 200 338

159

363

144 192

139 106

150

145

233

233

Note. From Helme et al. (Note 2). MOS = Military Occupational Specialty. 'Obviously these statements are predicted on the assumption that the Pearson r is normally distributed even when the population value (p) is nonzero. Despite statements to the contrary in many texts (e.g., Downie &

174

F. SCHMIDT, J. HUNTER, AND K. PEARLMAN

Table 7 Variability of Validities of Army Classification Battery (ACS) Across 35 Jobs

ACB subtest

r between validities in Samples 1 &2

r'

(T2

Vocabulary Arithmetic Reasoning Spatial Aptitude Mechanical Aptitude Clerical Speed Radiocode Aptitude Shop Mechanics Automotive Information Electronics Information Radio Information

.71 .82 .68 .76 .71 .19 .79 .85 .86 .84

.51 .56 .47 .51 .39 .34 .48 .41 .45 .32

.01672 .01783 .01613 .01369 .01398 .00809 .02353 .02670 .01271 .01544

Total

Ms 8 b

Sample 2 (N = 10,534)

Sample 1 (N = 10,488) True a2

.01187

.01462 .01097 .01040 .00926 .00154 .01859 .02270 .01093 .01297

True S£>" .1089 .1209 .1047 .1020 .0996 .0392 .1363 .1507 .1046 .1139

Total

True

r'

a2

a1

srf

.52

.02387 .02255 .01338 .01565 .01912 .01003 .02016 .03779 .02034 .02124

.01695 .01849 .00910 .01189 .01358 .00191 .01593 .03212 .01749 .01784

.1302 .1360 .0954 .1091 .1165 .0437 .1262 .1792 .1323 .1336

.57 .49 .50 .42 .35 .48 .38 .44 .32

.1081

True

.1202

Average validity coefficient for test across jobs; the correlation between these values for Samples 1 and 2 is .98. The correlation between true SOs in Samples 1 and 2 is .95.

above some minimum useful level of validity. Interestingly, applying this same decision rule to validities in this study—validities from entirely different jobs—leads to the conclusion that validities are generalizable for all of the 10 tests. This generalization is to other jobs that are members of the highly task-heterogeneous population of jobs from which the 35 jobs in Study 2 are a sample. Table 8 lists the validity values above which 90% of all coefficients lie for each of the tests in the two samples. (The results are remarkably similar for the two independent samples, illustrating the high reproducibility and stability of large-sample estimates in personnel research. The two columns in Table 8 correlate .95.) In every case these validity values are high enough to be of substantial practical utility in most selection settings (Hunter & Schmidt, in press; Schmidt, Hunter, McKenzie, & Muldrow, 1979), and this is true despite the fact that Heath, 1965, pp. 154-155), this assumption is true to a very close approximation with moderate and large sample sizes. Although the standard error of r does vary as a function of p, the shape of the sampling distribution is close to normal across almost the entire range of r (Kendall & Stuart, 1973, pp. 385-392; see especially Equation 16.74, p. 390). For example, when N = 300 and the population correlation is .50, the skewness of the sampling distribution of r is only —.17.

these coefficients have not been corrected for criterion unreliability. Thus, the probability that between-job variation in aptitude test validity will result in a situation in which the test is valid for one job but not for another is quite low. (Further, even if such a situation occurred, there would rarely be adequate statistical power to detect it.) These results indicate that the bounds of validity generalization are very broad indeed—much broader than we have previously concluded on the basis of our validity generalization research. Table 8 Ninety Percent Confidence Values for Average ACB Test Validities Test

Sample 1

Sample 2

Vocabulary Arithmetic Reasoning Spatial Aptitude Mechanical Aptitude Clerical Speed Radiocode Aptitude Shop Mechanics Automotive Information Electronics Information Radio Information

.37 .41 .34 .38 .26 .29 .31 .22 .32 .17

.35 .40 .37 .36 .21 .29 .32 .20 .27 .15

Note. ACB = Army Classification Battery (Helme et al., Note 2).

TASK DIFFERENCES

General Discussion The two studies reported here are the only systematic empirical evidence bearing on the hypothesis that task or behavior differences between jobs moderate test validities. Contrary to widespread belief in personnel psychology, these results indicate that task differences between job families within an occupational area have little or no effect on test validities and differences in test validity among entirely different jobs are small. The empirical foundations of these two studies are quite extraordinary. Study 1 is based on a total sample size of 368,877; Study 2 is based on 21,022 individuals. The total sample base is almost 400,000. Thus, these studies avoid the potential deceptiveness of small sample results (Schmidt & Hunter, 1978; Schmidt, Hunter, & Urry, 1976; Tversky & Kahneman, 1971). These sample sizes are critical in evaluating the results reported here. If the typical study has a sample size of 100, then the two studies reported here are equivalent to 3,899 such studies. Implications for Validity Generalization It is important to note that the findings reported here greatly strengthen our earlier conclusion that the hypothesis of situational specificity of test validity is false. Our studies of validity generalization (Pearlman et al., 1980; Schmidt, Gast-Rosenberg, & Hunter, 1980; Schmidt & Hunter, 1977; Schmidt, Hunter, Pearlman, & Shane, 1979) have shown that most of the variance in observed validity coefficients within single job types or job families can be accounted for by four statistical artifacts for which it is possible to make corrections. Our conclusion has been that other artifacts probably account for the small amounts of remaining variance. The finding that gross differences between jobs in task makeup produce relatively small variations in test validities means that it is highly unlikely that smaller variations in the content of the same job from setting to setting could produce larger variations in test validity. In fact, the results of the present studies indicate that the conclusions produced by our validity generalization studies are much too conservative. Those studies

175

underestimate substantially the generalizability of employment test validities. Implications for Job Analysis in Selection Research The findings of the two studies reported here have implications for the types of job analysis that should be used in selection research involving traditional aptitude tests. In such contexts, fine-grained, detailed job analyses tend to create the appearance of large differences between jobs that are not of practical significance in selection. Our results indicate that such molecular job analyses, so heavily emphasized in personnel selection research in recent years, are unnecessary in practice. Instead, much broader methods, such as those that permit the grouping of jobs on the basis of their broad content structure or their similarity in inferred ability requirements—without reference to specific tasks, duties, or behaviors— are the most appropriate and powerful techniques. Such methods are discussed in Pearlman (1980). Our findings make clear that such broader methods should also be used in demonstrating job similarity for validity generalization purposes. Implications for Criterion Construction Our results have important implications for the movement toward criterion fractionation. If large between-job differences in tasks do not moderate test validities (or do so to only a small extent), then task dimensions within the same job (e.g., of the kind produced by behaviorally anchored rating scales) are unlikely to do so. This conclusion is consistent with findings that correlations between criterion dimensions, after correction for attenuation due to unreliability, typically approach 1.00, indicating that different behavioral dimensions are virtually colinear at the true score level (Schmidt & Hunter, Note 3). Under these circumstances, it is obvious that only a measure of overall job performance is needed in validity studies. These considerations point to the conclusion that the only function of multiple criterion scales is to increase the reliability of the composite (overall) criterion measure.

176

F. SCHMIDT, J. HUNTER, AND K. PEARLMAN

That is, replication of judgments on essentially the same dimension leads to increased reliability in the same way that use of multiple judges or longer tests does (Schmidt & Hunter, Note 3). Implications for Other Hypothesized Moderators Of the various potential moderators of test validity that have been suggested by personnel psychologists, the fact of task differences between jobs has been regarded as most likely to produce large moderator effects. In fact, belief in the importance of task differences as a moderator has been so firm that apparently little need has been felt for empirical tests of this assumption. This has not been true in the case of other postulated moderators. The fact that large task differences between jobs do not moderate test validities (or do so only to a small extent) suggests that other hypothesized moderators are unlikely to be important. Examples include organizational climate; management philosophy or leadership style; geographical location; changes in technology, product, or job tasks over time; age; socioeconomic status; and applicant pool composition. (Race and ethnic group have previously been shown not to moderate test validities. See Hunter, Schmidt, & Hunter, 1979; Schmidt, Pearlman, & Hunter, 1980.) Our previous researches on validity generalization (Pearlman et al., 1980; Schmidt, Gast-Rosenberg, & Hunter, 1980; Schmidt, Hunter, Pearlman, & Shane, 1979) provide strong evidence that these and other moderators do not in fact exert important effects. All of these hypothesized moderators had the opportunity to exert their effects in the data on which these studies were based. Yet the typical finding was that after removal of variance due to statistical artifacts, the amount of variance remaining in validity coefficients from one study to another for the various test-job combinations was too small to leave room for the operation of nontrivial moderator effects. This approach to determining the probable magnitude of moderator effects is explored further in Schmidt and Hunter (1978). Thus, our previous research on the generalizability of validities

supports the prediction from the present study that commonly hypothesized moderators are unlikely to be important. Implications for the Uniform Guidelines The results reported here have important implications for the proper interpretation of the provision for validity generalization in the Uniform Guidelines on Employee Selection Procedures (USEEOC, 1978). These guidelines provide for validity generalization (called transportability) on the basis of a demonstration that new jobs to which one wishes to generalize previous validation results consist of "substantially the same major work behaviors" (Section 7b [2], p. 38299) as those jobs in the original studies. The findings in Study 1 show that the proper interpretation of this provision is one based on the general activity or content stucture characterizing a broad occupational area such as clerical occupations. These results show that narrower interpretations in terms of tasks, duties, or behaviors are both empirically indefensible and unnecessarily restrictive in the clerical area. This situation is probably similar in other occupational areas.2 In addition, the results of Study 2 show that the requirement of "substantially the same major work behaviors," even where broadly interpreted, may be unjustifiably stringent. Study 2 results show that validities of conventional aptitude tests differ little across jobs varying widely in broad duties, activities, and responsibilities. Further, as shown in Table 8, the probability is very low that such variations in validity as do exist will result in a situation in which the test would be considered valid for some jobs but invalid for others. Relation to Fleishman's Research It is instructive to relate our findings to the large body of research generated by 2

Fortunately, it appears that courts are already beginning to endorse a broader interpretation of what constitutes sufficient job similarity for validity generalization; see Friend v. Leidinger, 18 FEP 1055 (Fourth Circuit, November 29,1978); Pegues et al. v. Mississippi State Employment Service et al. (Northern District of Mississippi, March 7, 1980).

TASK DIFFERENCES

Fleishman and his associates over the last 25 years (cf. Fleishman, 1975, for a recent summary of this work). For psychomotor tasks of various kinds, Fleishman and his associates report that the relative contribution of different abilities (both psychomotor and cognitive) to performance changes as practice proceeds and that these changes are progressive and systematic and eventually become stabilized (Fleishman & Hempel, 1954, 1955). Our own examination of Fleishman's work (including a refactoring of data from three of his studies) has convinced us that these changes are real and are not the product of statistical artifacts and sampling error. Is there a contradiction between our findings and those of the Fleishman research program? We think not for two reasons. First, the Fleishman studies have each focused on a single well-defined but narrow task. For example, in the Complex Coordination task, "The subject is required to make complex motor adjustments of an airplanetype stick and rudder in response to successively presented patterns of visual signals" (Fleishman & Hempel, 1954, p. 240). In employment settings, jobs are typically not composed of a single task. In fact, if tasks are defined at this level of specificity, jobs are made up of a large number of widely varying tasks. Thus, the contribution of a given ability to total job performance will be an average of its contribution to each task. These average contributions will vary much less from job to job, as we find in our data. Second, the Fleishman studies focus on changes that take place during the early stages of practice, whereas the validities examined in our research pertain to performance as it has stabilized over a much longer period of time. Fleishman and Hempel (1954) assessed performance on the Complex Coordination task of 64 2-minute trials over a 2-day period. Total practice time was thus 2.13 hours. In another study, Fleishman and Hempel (1955) assessed performance on 16 trials on the Discrimination Reaction Time task over a period of about 45 minutes per subject. In this same study, the authors state that since performance and performance-ability correlations stabilize with

177

practice, "the problem is more one of identifying the particular abilities important at later stages of practice" (p. 310). Since the performance of participants in validity studies is evaluated only after they have been on the job (or in training) for some weeks, months, or years, the validity coefficients on which our findings are based represent abilities important at later stages of practice. In this connection, it is interesting to note that Adams (Note 4) found that the correlation between performances on different complex psychomotor tasks increases with practice and training. That is, the more practice and training the subjects were given on a variety of tasks, the more similar their rank order in the group became on the different tasks. Adams and Fleishman and Hempel (1955) suggest that this finding reflects the existence of a general "integration" ability that does not manifest itself until the individual components of each task have been learned. This ability leads to the organization of task components into their proper sequence and pattern and may be independent of facility in learning the individual components. If a similar process operates with respect to job performance, it would mean that over time, between-job correlations in performance would become progressively higher. That is, although the person who is initially best on job A may not have a high probability of being initially best on job B, over longer periods of time, individuals would develop highly similar rank orderings of performance in different jobs. At this time, validity differences between jobs for a given test would be small, precisely the result observed in the two studies reported here. What Does Moderate Validities? If tasks and behaviors do not moderate test validities, what does? The best candidate is the information-processing and problem-solving demands of the job. Jobs differing in superficial behaviors and tasks may often impose similar information-processing demands. In this connection, Ghiselli (1966, p. 106) reports suggestive results. He found that clustering jobs based on similarity in average validity patterns of intellectual-per-

178

F. SCHMIDT, J. HUNTER, AND K. PEARLMAN

ceptual, spatial-mechanical, and motor ability tests led to groupings entirely different from those produced when the same jobs were clustered on the basis of similarity of tasks, duties, and responsibilities. Similarly, Maier and Fuchs (1969, 1972) have shown that it is possible to construct groupings of widely differing (i.e., task-heterogeneous) jobs that nevertheless are characterized by the same pattern of test validities for predicting training school success. Pearlman (1980) has reviewed these and a number of similar cases in which task-heterogeneous jobs clustered together when analyzed in terms of their ability or information-processing requirements. However, even where information-processing demands differ, the moderator effect may be small because abilities tend to be highly correlated and compensatory. Large moderator effects (i.e., large differences between jobs) may occur at the level of (population) regression weights. That is, the small differences between jobs in test validity—which Study 2 shows do exist— may correspond to substantial differences in regression weights. If this is the case, the actual causal role of a given ability in job performance may differ substantially between jobs even though test validities differ to only a small extent. Studies examining this question would be of great theoretical significance. Practical significance will be constrained by the fact that alternate sets of positive weights applied to cognitive tests typically produce similar multivariate validities (Schmidt & Hunter, Note 3). However, even small increases in validity can have substantial practical value (Hunter & Schmidt, in press; Schmidt, Hunter, McKenzie, & Muldrow, 1979). In the area of personnel classification, between-job differences in regression weights can contribute to practical utility by reducing between-job correlations in predicted future job success (Hunter & Schmidt, in press). The practical problems in research of this kind are enormous; regression and beta weights have greater sampling error than validity coefficients. However, we have made some beginning attempts to tackle this problem (Schmidt, Hunter, & Caplan, Note 5).

Historical Antecedents If the conclusions indicated by Studies 1 and 2 are correct, it is appropriate to ask how the field of personnel psychology came to be so far off base. Answering this question requires a brief excursion into, and reinterpretation of, the history of the field. Essentially, our position is that the field was led to adopt many false conclusions as a consequence of a good intention: the intention to be rigorously scientific. From its inception, industrial psychology was to be a field of knowledge based on objective and verifiable empirical research findings. Although the settings in which industrial psychologists practiced their profession— mostly private companies—offered research opportunities, available sample sizes were almost invariably small. The choice was between attempting to build the field on an empirical foundation of small-sample studies or forgoing the possibility of an empirical foundation. In light of the aspirations of the field for scientific status, the latter alternative was unthinkable, and thus industrial psychologists came to accept small-sample studies. Because such studies were the only kind feasible, not much attention was paid to the potential deceptiveness of their results. The small-sample study was first regarded as better than no study at all and then gradually came to be regarded as almost as good as a large-sample study (Schmidt & Hunter, 1978). That is, personnel psychologists came to be believers in the "law of small numbers" (Schmidt & Hunter, 1978; Schmidt, et al., 1976; Tversky & Kahneman, 1971). But the laws of statistics, like the laws of physics, operate inexorably whether there is human awareness of them or not. Thus, the sampling error unavoidable in small studies guaranteed the accumulation of divergent and even contradictory results across studies (Hunter, Note 6). Dunnette (1963), in describing this state of affairs, said: It is noteworthy that the studies reviewed by Ghiselli show no typical level of prediction for any given test or type of job. In fact, there seems to be little consistency among various studies using similar tests and purporting to predict similar criteria, (p. 318)

Why were personnel psychologists unable to see that much or all of this variation might

TASK DIFFERENCES

179

be due to statistical or measurement arti- ingly, it follows that what constitutes job success is also facts? The answer, we believe, lies partly in likely to vary from place to place, (p. 18) insufficient understanding of statistical and This conclusion was highly disappointing; as measurement theory. For example, many in Guion (1976) has noted, it meant that not the field did not fully understand the nature only validity generalization but also theory of or need for corrections for criterion un- construction was beyond the realm of posreliability and range restriction. And, of sibility. course, few appreciated the magnitude of This is where matters remained until the sampling error in small samples, as indicated late 1950s and early 1960s, when behaviorist above. influences began to make themselves felt in But there was another important reason— industrial psychology. Behaviorist influences the influence of logical positivism, then the on psychology as a whole have been traced dominant philosophy of science in psychol- and critiqued by Koch (1964), Mackenzie ogy. Space does not permit an exploration (1972, 1977), McKeachie (1976), and of the properties of logical positivism or a Schlagel (Note 7). Mackenzie also shows discussion of its retarding effect on the prog- that the fundamental philosophy of science ress of psychology since the 1920s; for such assumptions underlying Watsonian classical discussions, we refer the reader to Mack- behaviorism and Spenceian-Hullian neobeenzie (1972, 1977), Schlagel (Note 7), and haviorism are derived from logical positivToulmin (Note 8). Suffice it to say here that ism, which has since been thoroughly dislogical positivism produced an extreme and credited as a philosophy of science (cf. now discredited distinction between "ob- Suppe, 1977; Schlage, Note 7). (Skinnerian servable" and "unobservable" phenomena behaviorism is a special and more complex (Glymour, 1980). Only the "observable" case; see Mackenzie, 1977.) Locke has docwas considered admissible for purposes of umented and convincingly critiqued behavbuilding a science. Foremost among those iorist assumptions and influences in nonsethings that were "observable" were empiri- lection areas of industrial-organizational cal data. Data became reified, and, as a re- psychology (Locke, 1972,1977,1979,1980). sult, it became difficult to appreciate that For example, he has shown that "behavior data in raw form could be—and typically modification" as used in some industrial orwere—distorted by statistical and measure- ganizations is not "behavioristic" at all but ment artifacts, causing them to convey a instead consists only of a set of new labels warped picture of the underlying processes for reward and punishment processes that that produced them. Since the data were have been known and used since antiquity. accepted at face value, explanations in terms Until recently, however, little attention of underlying processes were sought for the has been devoted to tracing and illuminating complexity that seemed indicated by the behaviorist influences in personnel (selecwide variation in study results. These expla- tion) psychology. This neglect has been due nations took the form of moderator variables in part to the fact that relative to other areas (Dunnette, 1963). of psychology, behaviorist influences (as opIn personnel selection, the most important posed to positivist influences) arrived relaof these moderators was the undifferentiated tively late, in the late 1950s and early 1960s. situation. Because validity results varied McGuire (Note 9) has pointed out behavwidely for the same predictor type and job iorist assumptions and influences in the from setting to setting, industrial psycholo- Standards for Educational and Psychologgists concluded that subtle differences in ical Tests (APA, 1974), the Principles for jobs or settings were a major determinant the Validation and Use of Employee Selecof test validities. As Albright, Glennon, and tion Procedures (APA, 1975), and the Smith (1963) confidently stated: stronger behaviorist influences in the federal If years of personnel research have proven anything, it government's Uniform Guidelines on Emis that jobs that seem the same from one place to another ployee Selection Procedures (USEEOC, often differ in subtle but important ways. Not surpris1978). Eyde, Primoff, and Hardt (Note 10)

180

F. SCHMIDT, J. HUNTER, AND K. PEARLMAN

have also pointed out and discussed behaviorist notions and influences as found in the Uniform Guidelines. Finally, Hunter (1980) has traced the historical course of behaviorist influences on personnel psychology. Behaviorist influences appealed to the strong desire in personnel psychology to be rigorously scientific, and therefore they encountered little resistance. Thus, the field was led down the primrose path for the second time by the laudable intention to be scientific. The implicit behaviorist message was that personnel psychology as it was then constituted was not really scientific because it was built on the differential psychology foundation of traits and abilities. Traits and abilities were said to lack scientific status because they were "unobservable," were "constructs," or were "internal processes," or all three. The message was that science could be based only on those things that are "objectively observable," and in psychology, this meant "behavior" (Koch, 1964; Mackenzie, 1972, 1977). Psychology was defined as the science of objectively observable behavior. These blandishments did not lead to the abandonment of the individual differences foundation of personnel selection. Indeed, this foundation cannot be rejected without abandoning the field as a whole. However, behaviorist influences did produce changes in the field. Traits and abilities lost some of their former status, and psychologists shifted their attention to job behaviors and tasks (e.g., Dunnette, 19633; Guetzkow & Forehand, 1961). Within the context of behaviorist assumptions, this shift in emphasis allowed selection psychologists to perceive their research as more scientific: It was now focused directly on behaviors. In addition, it provided the possibility of a solution to the problem of situational specificity of employment test validities. Behaviorist notions suggested that the reason validities were different from setting to setting was that the actual job behaviors were different. A finegrained analysis of job behaviors could be used to identify these differences, and the resultant information could be used to solve the problem of validity generalization (Dunnette, 1976; Pearlman, 1980). Validities

would be found to be tied to, and dependent on, specific observable behaviors. Thus was born the dual emphasis of the late 1960s and early 1970s on behavioral fractionation of criterion measures (e.g., the much-researched behaviorally anchored rating scale for assessing behavioral dimensions of job performance) and task-oriented job analyses. These developments are described in detail by Dunnette (1976). By the early 1970s most of the field had been converted to this new orientation. Today these ideas have been virtually set in concrete in the federal government's Uniform Guidelines on Employee Selection Procedures (USEEOC, 1978; McGuire, Note 9). It is now widely recognized in most areas of psychology that basic behaviorist assumptions are conceptually flawed (McKeachie, 1976) and that behaviorism as a scientific effort has failed in all its major objectives (Koch, 1964; Mackenzie, 1977). The research reported here is the first to test behaviorist assumptions as manifested in personnel psychology. The findings reported here show behaviorist assumptions to be false. These studies show that the key predictions based on behavioristic assumptions are disconfirmed. Test validities are not moderated by fine-grained task or behavior differences and they are moderated to only a very small extent by large task differences. It is our judgment that the natural alliance of personnel psychology is with the new and expanding field of cognitive psychology and not with classical behaviorism, neobehaviorism, or behavioristically oriented areas of psychology. Founded on an explicit rejection of behaviorist assumptions, cognitive psychology focuses on human information-processing and problem-solving skills. Such skills have repeatedly been found to be substantially correlated with job performance, in lower- as well as higher-level jobs. The behaviorist thrust denigrates the importance of such skills and, in some versions, comes close to denying their existence (Mackenzie, 1977). 3

Dunnette has since substantially altered his position on this question (cf. Dunnette & Borman, 1979).

TASK DIFFERENCES

Behaviorist Fallacies Thus far, we have sketched a brief outline of the events that led to the behaviorist penetration of personnel psychology. Massive empirical evidence shows that behaviorist hypotheses in personnel psychology are false. But because of their surface plausibility and because of their longstanding acceptance, behaviorist notions are not likely to be rejected without analysis. An understanding of the flaws inherent in behaviorist philosophical principles is necessary for a full appreciation of the import of the empirical findings reported here. There are two central claims to the modern behavioristic beliefs as they are manifested in the field of personnel psychology: that abilities are not observable and that behavior is observable. Both claims are false to fact. Consider the supposed observability of behavior. Suppose that a worker is to screw a certain bolt into a certain hole in each automobile as it passes on the assembly line. Is "screwing in the bolt" an observed behavior in the worker? Certainly not. During the 19th century, physiologists and physiological psychologists established the fact that no event can have an effect on the brain unless it is mediated by afferent nerve fibers connecting some sensory mechanism to the brain. Thus, the bolt hole in the automobile cannot be the causal determinant of behavior per se; the causal determinant must be a pattern of afferent nerve impulses in the brain (i.e., an unobserved event). Moreover, a detailed analysis of the neural pathways that are implied in vision alone shows that no pattern of neural impulses is ever repeated. There is such fine detail of neural pathways that the brain records even the smallest difference in the angle of the head, the viewing distance, the position of the car on the assembly line, the difference in the postural muscles stemming from differences in foot position or tool position, and so forth. Thus, strictly speaking, the stimulus of "bolt hole" is an event in the observing psychologist's head, not in the eye of the worker. If there is to be an equivalence of successive bolt holes for the worker (and we have no

181

doubt that there is), then that equivalence is a perceptual construction of the meaning of similar events computed in the brain. Thus the stimulus of behaviorism is a hypothetical construct whose existence is a matter of inference from observed data based on an implicit (in this case) theory. Thus, stimuli are no more observable than traits. Furthermore, physiology has established similar facts about the response or behavior. A motor act such as screwing in a bolt is known to be a highly complex pattern of time-sequenced patterns of neural impulses to thousands of muscle fibers, to postural muscles, to eye muscles, and so forth. Thus, it is well known that no response ever repeats itself either. Thus, any equivalence of successive acts must be an internal perceptual process carried out by the brain. Therefore, responses too are events in the observing psychologist's mind and hence not directly observable. The belief that workers react to successive acts as repetitions shows that behavior is a hypothetical construct that is stable over time even though specific neural impulses are not. Stable behaviors are inferred from current theory and verified using observations interpreted within that theory. Thus the response (i.e., the behavior) of behaviorism is no more observable than is an ability; both are hypothetical constructs in the minds of those who use them as theoretical devices. Now let us consider constructs as employed in differential psychology. Is quantitative ability, for example, an unobserved variable whose existence depends on some nebulous theory of unobserved brain function? Certainly not. So far as we are aware, none of the tests currently used in employment selection was derived from a psychological theory of numerical or quantitative ability (such as Piaget's theory of operations, which is under study in educational psychology). Rather, the basis for existing quantitative ability tests rests on what is currently called "content validity" (see Ebel, 1977, for an extensive discussion of this fact for all ability tests); that is, quantitative ability is defined in terms of the ability to correctly carry out a certain family of tasks.

182

F. SCHMIDT, J. HUNTER, AND K. PEARLMAN

Although this family of tasks has never been written out in a full formal manner (to our knowledge), it is sufficiently well specified that many different investigators in different places, using different populations, with different applications in mind, and at different times, have constructed instruments that have subsequently proven to be statistically equivalent to one another (i.e., instruments that define the same general factor to within error of measurement). In hundreds of cases, jobs such as accountant or computer programmer have been predicted to depend on quantitative ability because quantitative processes are obviously part of the job, and in hundreds of cases, these predictions have been borne out. So quantitative ability is an empirically defined measurement procedure like any other used in science. Though no measuring instrument in science is ever perfect (indeed, our measurement theory says that only an infinitely long test can be perfectly reliable), the values are observable in the usual language of science. Thus, both the term behavior and the term ability represent hypothetical constructs whose nature is defined by a psychological theory. Observed events and values can be tied to these terms only to the extent that the corresponding theoretical sentences have been shown to be valid. On the other hand, the theory that certain acts are repetitions of one another and the theory that certain individual differences stem from psychological skills or abilities are theories that predate psychology and that have massive empirical support. Thus, in ordinary language, both can be regarded as directly observable. Summary In summary, we conclude that several influences combined over time in personnel psychology to produce the false conclusion that task differences between jobs exert significant moderating effects on aptitude test validities; that is, the conclusion that aptitude tests are valid for some jobs but not for others. These influences include the desire for an empirical foundation for the field, logical positivism, belief in the law of small numbers, and behaviorist influences. Logical

positivism, the law of small numbers, and classical behaviorism and neobehaviorism have subsequently been discredited. The results reported here, based on a total sample size of almost 400,000, show that the conclusion induced by these influences is false: The moderating effect of tasks is negligible even when jobs differ grossly in task makeup and is probably nonexistent when task differences are less extreme. This fact has important implications for validity generalization, for the use of task-oriented job analysis in selection research, for criterion construction, for moderator research, and for the proper interpretation of the Uniform Guidelines on Employee Selection Procedures (USEEOC, 1978). The findings and conclusions reported here require changes in beliefs and assumptions of long standing in personnel psychology. Not surprisingly, then, this study has been carefully scrutinized by numerous reviewers and other colleagues. In the Appendix we present some of the questions that have been raised and our responses to them. Reference Notes 1. Dunnette, M. D. Validity study results for jobs relevant to the petroleum refining industry. Washington, D.C.: American Petroleum Institute, 1972. 2. Helme, W. E., Gibson, W. A., & Brogden, H. E. An empirical test of shrinkage problems in personnel classification research (PRB Tech. Research Note 84). Arlington, Va.: Adjutant General's Office, Personnel Research Branch, October 1957. 3. Schmidt, F. L., & Hunter, J, E. The measurement of job performance in criterion-related validity studies. Unpublished manuscript, U.S. Office of Personnel Management, Personnel Research and Development Center, 1978. 4. Adams, J. A. The prediction of performance at advanced stages of training on a complex psychomotor task (Research Bulletin 53-49). Lackland Air Force Base, Tex.: Air Research and Development Command, Human Resources Research Center, December 1953. 5. Schmidt, F. L., Hunter, J. E., & Caplan, J. R. Validity generalization: Results for two occupations in the petroleum industry. Washington, D.C.: American Petroleum Institute, October 1979. 6. Hunter, J. E. Cumulating results across studies: A critique of factor analysis, MANOVA, and statistical significance testing. Invited address at the meeting of the American Psychological Association, New York, September 1979. 7. Schlagel, R, H. Revaluation in the philosophy of science: Implications for method and theory in psychology. Invited address at the meeting of the

TASK DIFFERENCES American Psychological Association, New York, September 1979. 8. Toulmin, S. E. The cult of empiricism. In P. F. Secord (Chair), Psychology and philosophy. Symposium presented at the meeting of the American Psychological Association, New York, September 1979. 9. McGuire, J. Testing standards and the legacy of behaviorism. Unpublished manuscript, George Washington University, 1979. (Available from J. McGuire, Arlington County Personnel Department, 2100 North 14th Street, Arlington, Virginia 22201.) 10. Eyde, L. D., Primoff, E. S., & Hardt, R. H. What should the content of content validity be! Paper presented at the meeting of the American Psychological Association, New York, September 1979.

References Albright, L. E., Glennon, J. R., & Smith, W. J. The use of psychological tests in industry. Cleveland: Howard Allen, 1963. American Psychological Association, American Educational Research Association, & National Council on Measurement in Education. Standards for educational and psychological tests. Washington, D. C.: Author, 1974. American Psychological Association, Division of Industrial-Organizational Psychology. Principles for the validation and use of employee selection procedures. Dayton, Ohio: Author, 1975. Downie, N. M., & Heath, R. W. Basic statistical methods (2nd ed.). New York: Harper & Row, 1965. Dunnette, M. D. A modified model for test validation and selection research. Journal of Applied Psychology, 1963, 47, 317-323. Dunnette, M. D. Aptitudes, abilities, and skills. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology. Chicago: Rand McNally, 1976. Dunnette, M. D., & Borman, W. C. Personnel selection and classification systems. In M. R. Rosenzweig & L. W. Porter (Eds.), Annual review of psychology (Vol. 30). Palo Alto, Calif.: Annual Reviews, 1979. Ebel, R. L. Comments on some problems of employment testing. Personnel Psychology, 1977, 30, 55-63. Fleishman, E. A. Toward a taxonomy of human performance. American Psychologist, 1975, 30, 11271149. Fleishman, E. A., & Hempel, W. E., Jr. Changes in factor structure of a complex psychomotor test as a function of practice. Psychometrika, 1954, 19, 239252. Fleishman, E. A., & Hempel, W. E., Jr. The relation between abilities and improvement with practice in a visual discrimination reaction task. Journal of Experimental Psychology, 1955, 49, 301-312. Ghiselli, E. E. The validity of occupational aptitude tests. New York: Wiley, 1966. Glymour, C. The good theories do. In, Construct validity in psychological measurement. Princeton, N.J.: Educational Testing Service, 1980. Guetzkow, H., & Forehand, G. A. A research strategy for partial knowledge useful in the selection of executives. In R. Taguiri (Ed.), Research in executive

183

selection. Boston: Harvard School of Business Administration, 1961. Guion, R. M. Recruiting, selection, and job placement. In M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology. Chicago: Rand McNally, 1976. Hunter, J. E. Validity generalization and construct validity. In Construct validity in psychological measurement. Princeton, N.J.: Educational Testing Service, 1980. Hunter, J. E., Schmidt, F. L., & Hunter, R. Differential validity of employment tests by race: A comprehensive review and analysis. Psychological Bulletin, 1979,86, 721-735. Hunter, J. E., & Schmidt, F. L. Fitting people to jobs: The impact of personnel selection on national productivity. In E. A. Fleishman (Ed.), Human performance and productivity. Washington, D.C.: National Science Foundation, in press. Kendall, M. G., & Stuart, A. The advanced theory of statistics (Vol. 1, 3rd ed.). New York: Hafner, 1973. Koch, S. Psychology and emerging conceptions of knowledge as unitary. In T. W. Wann (Ed.), Behaviorism and phenomenology. Chicago: University of Chicago Press, 1964. Locke, E. A. Critical analysis of the concept of causality in behavioristic psychology. Psychological Reports, 1972, 31, 175-197. Locke, E. A. The myths of behavior mod in organizations. Academy of Management Review, 1977, 2, 543-553. Locke, E. A. Myths in "The myths of the myths about behavior mod in organizations." Academy of Management Review, 1979, 4, 131-136. Locke, E. A. Latham versus Komaki: A tale of two paradigms. Journal of Applied Psychology, 1980,65, 16-23. Mackenzie, B. D. Behaviorism and positivism. Journal of the History of the Behavioral Sciences, 1972, 8, 222-231. Mackenzie, B. D. Behaviorism and the limits of scientific method. Atlantic Highlands, N.J.: Humanities Press, 1977. Maier, M. H., & Fuchs, E. F. Development of improved aptitude area composites for enlisted classification (Tech. Research Rep. 1159). Arlington, Va.: U.S. Army Behavioral Science Research Laboratory, September 1969. (NTIS No. AD-701 134) Maier, M. H., & Fuchs, E. F. Development and evaluation of a newACB and aptitude area system (Tech. Research Note 239). Arlington, Va.: U.S. Army Behavior and Systems Research Laboratory, September 1972. (NTIS No. AD-751 761) McKeachie, W. J. Psychology in America's bicentennial year. American Psychologist, 1976, 31, 819-833. Pearlman, K. The validity of tests used to select clerical personnel: A comprehensive summary and evaluation (Tech. Study TS-79-1). U.S. Office of Personnel Management, Personnel Research and Development Center, August 1979. (NTIS No. PB 80-102650) Pearlman, K. Job families: A review and discussion of their implications for personnel selection. Psychological Bulletin, 1980, 87, 1-28. Pearlman, K., Schmidt, F. L., & Hunter, J. E. Validity

184

F. SCHMIDT, J. HUNTER, AND K. PEARLMAN

generalization results for tests used to predict job proficiency and training success in clerical occupations. Journal of Applied Psychology, 1980, 65, 373-406. Prien, E. P., & Ronan, W. W. Job analysis: A review of research findings. Personnel Psychology, 1971, 24, 371-396. Schmidt, F. L., Gast-Rosenberg, I., & Hunter, J. E. Validity generalization results for computer programers. Journal of Applied Psychology, 1980, 65, 643661. Schmidt, F. L., & Hunter, J. E. Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 1911,62, 529-540. Schmidt, F. L., & Hunter, J. E. Moderator research and the law of small numbers. Personnel Psychology, 1978, 31, 215-231. Schmidt, F. L., Hunter, J. E., McKenzie, R., & Muldrow, T. The impact of valid selection procedures on workforce productivity. Journal of Applied Psychology, 1979, 64, 609-626. Schmidt, F. L., Hunter, J. E., Pearlman, K., & Shane, G. S. Further tests of the Schmidt-Hunter Bayesian

validity generalization procedure. Personnel Psychology, 1979, 32, 257-281. Schmidt, F. L., Hunter, J. E., & Urry, V. W. Statistical power in criterion-related validity studies. Journal of Applied Psychology, 1976, 61, 473-485. Schmidt, F. L., Pearlman, K., & Hunter, J. E. The validity and fairness of employment and educational tests for Hispanic Americans: A review and analysis. Personnel Psychology, 1980, 33, 705-724. Suppe, F. (Ed.). The structure of scientific theories. Urbana: University of Illinois Press, 1977. Tversky, A., & Kahneman, D. Belief in the law of small numbers. Psychological Bulletin, 1971, 76, 105-110. U.S. Department of Labor. Dictionary of occupational titles (4th ed.). Washington, D.C.: U.S. Government Printing Office, 1977. U.S. Equal Employment Opportunity Commission, U.S. Civil Service Commission, U.S. Department of Labor, & U.S. Department of Justice. Uniform guidelines on employee selection procedures. Federal Register, 1978, 43(166), 38295-38309.

Appendix Questions/Comments and Answers Question. The five clerical job families in Study 1 are described as being task homogeneous. Consider the alternative assumption that the job families are actually task heterogeneous. The greater the degree of task heterogeneity, the greater will be both similarities between families and differences within families. At the limit, job families would be made up of randomly selected jobs and within-family variance would be equal to between-family variance and total variance. Does the reader and potential user have any information to convince him or her that the job families are indeed task homogeneous? Answer. We believe it is self-evident from an examination of job families in Table 1 that task differences between families are greater than those within families. We would also reiterate that great care was taken to ensure that jobs were accurately assigned to families. However, two other facts constitute the most convincing response to this question. First, our earlier validity generalization studies (see text) have shown that most of the withinfamily variance in observed validities is due to statistical artifacts. If the job families were taskheterogeneous and if task differences moderated validities (the two hypotheses inherent in this question), then this finding would not have been obtained. Alternatively, if the job families are in fact task-heterogeneous, and it is nevertheless true that within-family variance in observed validities is mostly due to artifacts, then the conclusion must

still be that task differences have little moderating effect on validities. Thus, there seems to be no way that the findings of Study 1, taken together with the results of our earlier validity generalization studies, can be interpreted as consistent with the moderator hypothesis. Second, the question ignores the findings of Study 2. There can be no question that the jobs in Study 2 (see Table 6) are quite task heterogeneous. Yet these large task differences produced only very small moderating effects. Comment. The variance in validities in Study 2 induced by task differences between jobs does not appear to be as minute as the authors claim. The coefficients, after correction for range restriction and sampling error, have an SD across jobs of. 11. Therefore, the range of validities for a single test is approximately .20 to .65, quite a substantial range. Answer. The fact that the validity for a given test may have a range across jobs of 45 correlation points is not inconsistent with our conclusions. We do not conclude that our results show that the validity of a given test will be numerically equal for all jobs. On the first page of this article, we specifically define a moderating effect as an effect such that the test has substantial validity for some jobs but is invalid (has a near-zero validity) for others. The data in Study 2 show that (even without corrections for criterion unreliability) this condition occurs with remarkable infrequency. A difference corresponding to the range cited here,

TASK DIFFERENCES

that is, a difference of four SDs, will occur less than 4 times out of 1,000. The odds against a difference of this magnitude are 270 to 1 (p = .0037). Nevertheless, even when such a difference occurs, it will rarely be the case that the test will have zero or near-zero validity for one of the jobs in the pair. In other words, even in the highly improbable event that such a difference occurs, only very rarely will our conditions for a moderating effect be met. Thus, it is obvious that someone who assumes that the test is valid (i.e., has a useful level of positive validity for all jobs) will be wrong only with extreme infrequency. In other words, our results indicate that as long as one is dealing with tests measuring traditional cognitive skills or constructs (e.g., verbal ability, arithmetic reasoning, spatial ability, etc.), the conclusion that each such test is valid for all the jobs under consideration will virtually never be in error. Question. Just as in any other study, the validity of the conclusions rests on the properties of the criterion. Reference is made to proficiency criteria and training performance criteria. What is the nature of the basic data of these criteria? Ratings? Unstructured observations? Structured observations? Objective performance? To the extent that the criteria are ratings, then we may simply be observing relations between ability/aptitude tests on the one hand and general interpersonal likeability or adaptability of the ratee on the other. Thus, one would expect homogeneity of validity coefficients due to no differentiation in what is being predicted. Answer. This question suggests that supervisory judgments (ratings, rankings, etc.) of employee performance are nothing more than reflections of "interpersonal likeability or adaptability." That is, it assumes that supervisors cannot evaluate employee performance. If this were true, personnel psychology would indeed be in trouble, since most validity studies (and performance appraisal systems) employ supervisory evaluations of performance. However, there is no research evidence that this speculation is correct. In the present research, most measures of job proficiency (Tables 2 and 3) were indeed super-

185

visory evaluations. However, evaluations of performance in training, both in Study 1 (Tables 4 and 5) and in Study 2 (Table 7) were almost invariably based on direct measures of amount learned in training (e.g., written or performance tests of knowledge or skills). The important point is that the results obtained and the conclusions reached are the same for the two kinds of criterion measures. Whether criterion measures are based on supervisory judgments or not, the results indicate that task differences between jobs exert little or no moderating effect on validities. Comment. Other people, not behaviorists, would argue that the action in variables that might moderate relations between test behaviors and job behaviors lies in characteristics of environments. To the extent that such characteristics are not assessed in these studies or to the extent that they vary but little, then the moderator hypothesis has not been given a chance to exert itself. Answer. This comment misses one of the most important points of this article. In the section entitled "Implications for Other Hypothesized Moderators," we point out that all such potential moderators, including "characteristics of environments," have had the opportunity to exert their effects in our earlier validity generalization studies. That is, the data on which those studies were based come from validity studies conducted in a wide variety of organizations and settings. These data can therefore safely be assumed to reflect the range of variation characteristic of such potential moderators in real organizations. Yet the finding was that after variation due to artifacts was partialed out, very little variance remained within which such moderators could operate (see also Schmidt & Hunter, 1978). Now, it is possible (though it remains to be demonstrated) that if one artificially increased the range of potential moderators (e.g., by creating extreme variations in "organizational climate" in a laboratory study), the size of the moderator effect could be increased. But such a finding would not affect the validity of our conclusions about the operation of moderators in real organizations. Received April 24, 1980