police officers and detectives except in public service; group 377, sher- iffs and bailiffs; and group 379, protective service occupations not elsewhere classified.
PERSONNEL RYCHOL~GY 1986. 39
VALIDITY GENERALIZATION RESULTS FOR LAW ENFORCEMENT OCCUPATIONS HANNAH ROTHSTEIN HIRSH, LOIS C. NORTHROP U.S.Office of Personnel Management FRANK L. SCHMIDT University of Iowa
The Schmidt-Hunter interactive validity generalization procedure was applied to validity data for cognitive abilities tests for law enforcement occupations. Both assumed artifact distributions, and distributions of artifacts constructed from information contained in the current sample of studies were used to test the hypothesis of situational specificity and to estimate validity generalizability. Results for studies using a criterion of performance in training programs showed that validities ranged from .41 to .71, and for four test types the hypothesis of situational specificity could be rejected using the 75% decision rule. For the remaining test types, validity was generalizable, based on 90% credibility values ranging from .37 to .71. Results for studies using a criterion of performance on the job indicated that the hypothesis of situational specificity was not tenable for three test types, which had validities between .I7 and .31. For the remaining test types, estimated mean true validities ranged from .I0to .26 and were generalizable to a majority of situations. Results for both groups of studies were essentially identical for the two types of artifact distribution. Possible reasons for the apparently lower validities and lesser generalizability for job performance criteria are discussed, including possible low validity of the criterion (due to lack of opportunity by supervisors to observe behavior) and the potential role of noncognitive factors in the determination of law enforcement job success. Suggestions for specifically targeted additional research are made.
This study has two major purposes: (1) to estimate the validity of written tests of cognitive ability used to select professional law enforcement personnel and (2) to compare validity generalization findings based on artifact distributions drawn from sample data to findThe opinions and conclusions expressed in this article are those of the authors and do not necessarily reflect the policy of the institutions with which they are affiliated. Requests for reprints should be sent to Hannah Rothstein Hirsh, 1421 Hudson Road, Teaneck, NJ 07666. COPYRIGHT1986 PERSONNEL PSYCHOLOGY. INC. Q
399
400
PERSONNEL PSYCHOLOGY
ings obtained using “assumed” distributions. The data base consisted of all available criterion-related validity studies of tests of cognitive abilities conducted on law enforcement personnel. Campbell, in his remarks as outgoing editor of the Journal of ‘Applied Psychology (1982), indicated that one of the large gaps in the published selection research literature was validity studies on police and firefighters. This study cumulates the research evidence across all available studies, and the results can be used to fill that gap for police occupations. In applying validity generalization methods in most previous studies, it was necessary to use assumed distributions of artifacts to estimate artifactually induced variance in validity coefficients because information on criterion reliability, test reliability, and range restriction is not presented in most validation studies. In the present study, information on these artifacts was available from some of the studies, enabling us to analyze the data in two ways: (1) using the assumed distributions constructed by Schmidt and Hunter, and (2) using artifact distributions constructed with information extracted from the validation studies. We were thus able to determine how closely the Schmidt and Hunter (1977) distributions approximated those developed from our data set and whether the results or conclusions drawn from the two analyses were similar or different.
Method Identification of Occupations Included in the Study The occupations studied are those in category 37, Protective Service Occupations, of the Dictionary of Occupational Titles (US.Department of Labor, 1977), excluding firefighters and armed forces enlisted personnel. The latter two categories were excluded because there are large differences between their job content and that of law enforcement jobs. The remaining five 3-digit occupational groups in the protective service category were classified as law enforcement occupations. These were: group 372, security guards and corrections officers; group 375, police officers and detectives in public service; group 376, police officers and detectives except in public service; group 377, sheriffs and bailiffs; and group 379, protective service occupations not elsewhere classified. Over 80% of the validity coefficients were from studies of jobs in the police officers and detectives in public service category (group 3 7 3 , while the remaining 20% were distributed over the other four groups. Based on a decision rule explained below, this meant that at the individual occupational group level separate metaanalyses could be conducted only for group 375.
40 I
HIRSH, NORTHROP, AND SCHMIDT
TABLE 1 Summary Descriptive Siatisiics of Law Enforcement Studies Criterion type Statistic Number of coefficients Percentage of coefficients from published studies Percentage of studies from “Police and Detectives in Public Service” job group (DOT Occupational Code 375) Percentage of coefficients from predictive (vs. concurrent) studies Mean sample size Mean observed validity (across all test types)
Training success
Job proficiency
138 55
242 43
91
74
71 142 .36
30 92 .13
The sample of studies. In addition to reviewing the published literature, we contacted numerous state and local jurisdictions and consultants and two law enforcement-related professional associations in an attempt to locate unpublished criterion-related validity studies of law enforcement personnel. All criterion-related validity studies that (1) included cognitive abilities, (2) used either a job proficiency or training criterion, and (3) reported sample size, test type, criterion type, and validity coefficient were included in the database. There were 40 such studies, representing a somewhat greater number of independent samples, and a total of 381 validity coefficients. Of these 381, 242 used a job proficiency criterion and 138 used a training performance criterion. Table 1 presents various descriptive statistics pertinent to these data. Note that the average sample size of the training studies was 142 and that only 22% were at or below an N of 68, the median size found by Lent, Auerbach, and Levin (1971) in their review of published studies. The average sample size for studies using proficiency criteria was 92, with 40% of the studies at or below an N of 68. Other notable differences for the proficiency studies were the greater proportion of jobs from occupational categories other than police and detectives in public service, the smaller proportion of predictive (versus concurrent) validity studies, and the smaller mean observed r in this group. Tests were classified using a modified version of Pearlman, Schmidt, and Hunter’s (1 980) system. Eight general test-type categories were originally established, all but one of which (clerical aptitude) represented a construct or ability factor found in the psychometric literature. These test types were clerical aptitude (any composite of verbal and quantitative abilities and perceptual speed),
402
PERSONNEL PSYCHOLOGY
memory, psychomotor ability, perceptual speed, quantitative ability, reasoning, spatial/mechanical ability, and verbal ability. Operational definitions of the test-type categories and the item types included in each follow Pearlman (1979) and Pearlman et al. (1980), with the following exceptions. Pearlman et al. placed composite tests that measured verbal, quantitative, and either reasoning or spatial/mechanical ability into a general mental ability category. Because the composite tests in our sample were more varied in content, we classified each composite of two or more cognitive abilities as a separate test type. Additionally, while Pearlman et al. classified tests of arithmetic reasoning under quantitative ability, we followed Ekstrom, French, and Harman’s (1976)classification and placed these tests in the reasoning category. Our quantitative ability category contained tests of numerical computations and tests of quantitative ability not further specified. We also created a category for tests that, while primarily cognitive in nature, also included measures of human relations or driving practices. We have no theoretical rationale for this, but the alternative of discarding these data (which represented 7% of all training coefficients and 10% of the proficiency coefficients) was less desirable. Data from studies in this category are reported separately and are not included in totals or means. The decision rules used to determine whether or not data were suitable for inclusion in the research, as well as those which applied to recording the validity coefficients from acceptable studies, generally follow those established by Pearlman et a]. (1980).The only exception is that in studies reporting validity for different sex or racial subgroups of a sample and for the total sample, we recorded only the total group data-unless artifact information was available for the subgroups only and not for the total group, in which case the subgroup data were recorded.
Data Analysis Table 2 shows the frequency distributions of validity coefficients by test type and criterion type for jobs in the police and detective in public service category. For purposes of our study, we limited our analyses to validity distributions containing at least six coefficients. As explained in Pearlman et al. (1980), there is no theoretical basis for setting a lower limit on the number of coefficients required for a validity generalization analysis. However, it is desirable to impose a minimum (albeit an arbitrary one) because the expected accuracy of the
403
HlRSH, NORTHROP, AND SCHMIDT
TABLE 2 Frequency of Compiled Validity Cmficients by Test Type Police and detectives in public service (DOT) occupational group 375)
Test type
Training
Proficiency
6a 0 4 9a 24a 1 3a 26a
25a 9a 4 8a 29a 29a 18a 3
A. Cognitive: Memory (M) Psychomotor (P) Perceptual speed (PS) Quantitative (Q) Reasoning (R) Spatial/Mechanical (SP/M) Verbal (V) Clerical (V+Q+PS) V + R V+PS+R V M C R Integrative processes V R M S/M V + Q + R V Q S/M V + Q + M + R V Q S/M PS R Verbal composite, not otherwise specified General cognitive composite not specified 6. Congnitive plus noncognitive composites: V R Q Human relations (HR) V R M Driving (D) M V + Q + R + D + HR M V R D HR
5
7a 1
+ + + + + +- + + + + + + +
5
4 3 5 4 1
0 0
+ + + + + + + + + + +
5
0 4 1
Totals
127
1 3
3 7a 4 5 1 2 0 3
4 2 4 15a 179
aRetained for analysis
estimated true validity distribution is higher when based on a relatively large number of coefficients. This is particularly true of the SD of the distribution (cf. Schmidt, Hunter, Pearlman, & Hirsh, 1985). In this regard, it is important to note that the distributions analyzed in the present study are all composed of relatively small numbers of coefficients. The largest number of coefficients in any of the distributions studied was 29.
Assumed and Sample-Based Artifact Distributions The data were analyzed using sample-based artifact distributions and using assumed artifact distributions. Tables 3, 4, and 5 contain
PERSONNEL PSYCHOLOGY
404
TABLE 3 Assumed Distribution of (Unrestricted) Test or Training Criterion Reliabilities Across Studies (Constructed by Schmidt & Hunter. 1977. Expected value = ,801
Reliability
Relative frequency
.90 .85
30
15
.80 .75
25 20 4 4 2
.70 .60
SO TABLE 4
Assumed Distribution of Range Restriction Effects Across Studies (Constructed by Schmidt & Hunter, 1977. Expected value = 5.945)
Prior selection ratio
SD of Test
Relative frequency
1 .00
10.00 7.01 6.49 6.03 5.59
5 11 16 18 18 16 11 5
.70 .60 SO .40
.30
5.15
.20
4.68 4.1 1
.I0
the assumed distributions of artifacts constructed by Schmidt and Hunter. A discussion of the rationale behind these is presented in Schmidt and Hunter (1977) and in Pearlman et al. (1980). Table 6 presents summaries of artifact distributions constructed from the studies analyzed.' Predictor reliabilities. When not presented in the validity study, reliability coefficients for predictors were often available from test manuals for standard tests. Because we had gathered data on the number of items in a predictor (where reported), as well as on predictor reliability, we were able to conduct a partial check on the representativeness of the distribution of reported reliabilities relative to the predictor reliabilities of the total data set. For each test type-criterion type combination, we compared the average number of items from those studies reporting both test reliability values and number of items to those for which test reliability information was unavailable but the number of items was reported. For every test type, the aver'More detailed information is available in Validity Generalization Results for Law Enforcement Occupations (Hirsh, H. R., Northrop. L. C., Schmidt, F. L., 1985). an Office of Personnel Management technical report.
HIRSH, NORTHROP, AND SCHMIDT
405
TABLE 5 Assumed Distribution of Proficiency Criterion Reliabilities Across Studies (Constructed by Schmidt & Hunter. 1977. Expected value = .60)
Reliability
Relative frequency
.90
3
.85
4
.80
6
.75
8 10
.70 .65
.60 .55
SO .45 .40
.35 .30
12 14 12 10 8 6 4 3
age number of items in those cases where reliability was known was greater than the average number of items for cases not having an associated reliability estimate. A more representative distribution of reliability values was obtained by computing reliability estimates for the cases in which only the number of items was reported. This was done in the following manner: within each test type, we identified those studies where both predictor reliability and number of items were reported. For each such case, we used the Spearman-Brown formula in reverse to compute the reliability of one item from the whole test reliability; we then summed these and took their average. Then, by applying the Spearman-Brown formula in the ordinary fashion, the average single-item reliability for each test type was used to estimate the reliability of those predictors for which only the number of items was available. The final number of validity coefficients with matched predictor reliabilities for test types with six or more coefficients was 63 for studies using a training performance criterion and 75 for studies using a job proficiency criterion. Test reliability ranges, means, and standard deviations by test type are shown in Table 6 for each criterion type. These values are probably overestimates to some degree because those predictors with no associated reliabilities or numbers of items appeared in many cases to be locally developed and short, rather than commercial and of standard length, and in other cases they were clearly short subtests of longer multiaptitude tests. Criterion reliabilities. The inadequate information on criterion reliabilities in the training criteria data did not permit construction of a sample-based distribution. We therefore used only Schmidt and Hunter's assumed distribution in applying validity generalization pro-
PERSONNEL PSYCHOLOGY
406
TABLE 6 Summaries of Artifact Distributions Constructed from Data in Sample
I . Predictor Reliabilities
Number
of Test type
reliability coefficients
A. Proficiency Criterion
Memory (M) Psychomotor (P) Quantitative (Q) Reasoning (R) Spatial/Mechanical (Sp/M) Perceptual speeda (P/S) Verbal (V) V R M +Sp/M V +Q+Sp/M
+ +
5 9 8 II 12 13 6 5
Range of reliability values .61 .81 .69 .34 .54 .60 .34 .84 .80 -
Mean value
.81
-
.92 .89 .84 .85 .93 .93 .92
X
B. Training Criterion Memory (M) Quantitative (Q) Reasoning (R) Spatial/Mechanical (Sp/M) Verbal (V) V+R
6 5 21 5 20 6
.73 .85 .88 .69 .78 .81 .78 .88 .83
- .91
=
- .81
.72 .49 .32 .65 .34 .89 -
.84 .92 .80 .96 .94
X
.067 ,054 .077 ,169 .086 .093 ,175 .042 .047
.803 .76 .69 .69 .74 .75 .91
=
Standard deviation
.037 . I 13 .I69 .067 .I52 .016
,757
11. Criterion Reliabifitiesb
Number of reliability coefficients
Test Type A. Proficiency Criterion
Memory (M) Psychomotor (P) Perceptual Speed (P/S) Quantitative (Q) Reasoning (R) Spatial/Mechanical (Sp/M) Verbal (V) V R + M+Sp/M V+Q+Sp/M
+
26 15 7 12 32 38 32 7 8
Range of reliability values
Mean value
- .96
.54 .66 .60 .60 .60 .60 .54 .64 .60 -
.84 .88 .88 .96 .96 .96 .96 .88
.77 .76 .70 .69 .73 .77 .67 .79 .70
-
X
=
.726
Standard deviation .I27 .077 ,093 .096 .I20 . I 15 . I 16 .956 .I03
407
HIRSH, NORTHROP, AND SCHMIDT
Table 6 (conf.)
I I I. Restriction of range‘ Criterion type Proficiency Training
Number of values
Range of values
- 8.98
17
3.44
18
3.44 8.98
-
Mean value 6.12 5.94
Standard deviation 1.21
1.17
aSince these are not speeded tests, only reported reliabilities were used; the N of items was not used to estimate “missing” reliabilities. b h e r e were too few training criterion reliabilities to permit construction of a sample based artifact distribution. cValues shown relative to an unrestricted standard deviation value of 10.0.
cedures to the set of training studies. The fact that there were numerous reported criterion reliabilities for the job proficiency validity coefficients, however, enabled us to construct separate criterion reliability distributions for different test types. There was no theoretical reason for test type to affect criterion reliabilities, but this approach was the closest approximation permitted by the data to the ideal of individual correction of coefficients. The proficiency criteria were, nearly without exception, supervisory ratings of performance. The reported criterion reliabilities, the type of reliability, and the interval between ratings, were recorded. Methodologically superior studies reported criterion reliability estimates more often than did other studies. The reported reliability was often not the appropriate type (interrater reliability); internal consistency reliabilities were not used in estimating criterion reliability distributions. The remaining estimates were of interrater agreement or of indeterminate type. When a criterion reliability value was reported as the reliability of a supervisory rating, with no information about number of raters or type of reliability, we included the coefficient as reported, with no correction, as the estimate of reliability. Some of these coefficients were probably intra-rater reliabilities, and so this decision rule probably produced overestimates of actual reliability in some cases. For studies not reporting any estimate of criterion reliability, a value of .60 was assigned, based on King, Hunter, & Schmidt’s (1980) meta-analytic estimate of average interrater reliability for one rater. In these studies, it was generally either stated or implied that a single rater was used. Summaries of the resulting distributions of criterion reliabilities for the proficiency data, by test type, are presented in Table 6. Range restriction. Levels of range restriction were indexed as s / S , where s = the restricted predictor standard deviation and S = the unrestricted standard deviation. In some studies in which range restriction values were not given, validity coefficients were presented un-
408
PERSONNEL PSYCHOLOGY
corrected and corrected for restriction of range. In such cases, the ratio s / S was recovered by solving Thorndike’s (1 949) Case I1 equation for s/S. Because usable range restriction data was infrequently reported in the validity studies, it was not possible to construct different distributions by test type. However, as was the case with criterion reliability, there is no theoretical reason to expect that range restriction values would vary by test type (or by criterion type). But we were able to construct separate range restriction distributions for the two criterion types, making our artifact distributions as specific as possible to the data they were applied to. The two resulting distributions were very similar. The ratio was multiplied by ten to standardize all values to an unrestricted SD of 10, and all values were assigned equal frequencies. Summaries of the resulting range restriction distributions are shown in Table 6. In many of the studies presenting no usable range restriction information, the authors noted that range restriction was extreme. Thus the values in Table 6 probably underestimate the true extent of range restriction. The Schmidt-Hunter interactive validity generalization procedure (Schmidt, Gast-Rosenberg, & Hunter, 1980) was applied to the data for both assumed and sample-based artifact distributions. Results Results for Training Criteria Situational specifcity. Table 7 presents the results for the situational specificity hypothesis for validities against training performance criteria. The first four columns of the table contain, respectively, the total sample size, the number of validity coefficients on which each distribution was based, the uncorrected (i.e., observed) mean validity, and the observed standard deviation (SD) of validities in the distribution. The predicted SD is the standard deviation predicted solely on the basis of the four artifacts corrected for (sampling error and between-study differences in test unreliability, criterion unreliability, and degree of range restriction). The next column reports the percentage of observed variance that is accounted for by these artifacts. The column headed Residual SD presents the square root of the variance that remains after the variance attributed to the four artifacts is removed from the observed variance. This is the standard deviation of observed validities that would be expected if (1) sample size were infinite in each study, that is, there were no sampling error, and (2) test unreliability, criterion unreliability, and range restriction were all held constant at their mean values.
HIRSH, NORTHROP, A N D SCHMIDT
409
TABLE 7 Situational Spec6city Hypothesis Results far Police and Detectives in Public Service (Training Performance Criteria), Using Assumed and Sample-Based Artifact Distributions Sample-based artifact dist. I Var. Resid. Pred. acct. Resid. SD SD fora SD
Assumed art. dist.
I Total Test
~VLX
Cognitive Memory (M) Quantitative (Q) Reasoning (R) (SP/M) Verbal ( V ) V+R Total /Means
N
801 1,206 4,374 1,422 3,943 1,151 12.897
Number of r’s
6 9 24 13
26 7 85
T
,223 ,343 ,331 ,275 ,369 ,478 ,337
Var. Obs. Pred. acct. SD SD fora
,079 ,149 ,120 ,059 ,161 ,046 ,102
,092 100 ,098 43 ,090 56 ,103 100 ,097 36 ,103 100 ,097 72.5
0 .I12 .079 0 ,129 0 .053
,092 100 .lo3 48 .I03 75 ,103 100 ,109 45 .I96 100 .I17 79
0 .lo7 ,060
0 ,119 0 ,048
aWhere predicted variance exceeded the observed variance, the percentage of variance accounted for is shown as 100%.
Let us first consider the results in Table 7 using assumed artifact distributions. Using the 75% rule, the situational specificity hypothesis can be rejected for the memory and spatial/mechanical test types and verbal plus reasoning composite. For these test types, validities are concluded not to vary from situation to situation, and the best estimate of the true validity of these test types for law enforcement training performance is the mean true validity. When those cases in which the predicted variance exceeds the observed variance due to second-order sampling error (Schmidt, Hunter, Pearlman, & Hirsh, 1985) are rounded down to 100% of variance accounted for, the average percentage of variance accounted for across all distributions (Table 7) is 72.5%. This is very close to the average amount of variance accounted for reported in other validity generalization studies. If these figures are not rounded down to loo%,the mean (weighted by sample size) is 121%, which is rounded down to 100%. This is the more accurate estimate required for a second-order meta-analysis (cf. Schmidt et al., 1985, Question and Answer 25). For results in Table 7 obtained using the sample-based artifact distributions, it can be seen that under the 75% rule, the hypothesis of situational specificity can be rejected for four test types: memory, reasoning, spatial/mechanical, and the verbal plus reasoning composite. Three of these are the same test types for which the situational specificity hypothesis was rejected using the original artifact distributions; the fourth, reasoning, is added because use of the sample-based arti-
410
PERSONNEL PSYCHOLOGY
fact distributions attributes an additional 19% of the observed variance to artifacts. The average percentage of variance accounted for was 79%, which is 6-95 percentage points higher than when the original assumed artifact distributions are used. The sample-size-weighted percentage of variance accounted for, if figures are not rounded down to loo%, is 158% (which again is rounded down to 100%). Thus the two sets of artifact distributions produce similar results on the whole, with the original Schmidt-Hunter distributions yielding somewhat more conservative results. Validity generalization. The distributions of estimated true validities are presented in Table 8. The mean p of each distribution is the only parameter of interest for those test types for which the hypothesis of situational specificity was rejected. For the assumed artifact distribution, these are .49 for tests of spatial/mechanical ability, .40 for memory, and .75 for the verbal plus reasoning composite. These values are very similar when the study-based artifact distributions are used. Table 8 presents evidence for the generalizability of validities where the hypothesis of situational specificity could not be rejected. For studies using a criterion of performance in training programs, the critical data are contained in the columns of Table 8, which report the mean, standard deviation, and 90% credibility value of the corrected validity distributions, based on both assumed and sample-based artifact distributions. The results indicate that validity is generalizable in every case. The results for the assumed artifact distributions show that the lowest 90% credibility value was .34 and the highest was .75; for the sample-based artifact distribution the lowest was .38 and the highest was .71. The best estimate of the validity is not, however, the 90% credibility value, but rather the mean of the true validity distribution. In correcting the residual distribution to produce the distribution of estimated true validities based on sample-derived artifacts, we attenuated the fully corrected mean correlation (i.e., the mean validity corrected for range restriction, criterion unreliability, and test reliability using average values of each) not by the square root of the mean observed test reliability value, but rather by the square root of .80. Attenuating each fully corrected mean by the square root of average test unreliability would have produced a true validity distribution tailored to tests with a reliability equal to the average reliability in our sample. Our average test reliabilities are somewhat lower than those usually associated with commercially produced tests of typical length,
41 1
HIRSH, NORTHROP, A N D SCHMIDT
TABLE 8 Validity Generalization Results for Training Performance Distributions, Police and Detectives in Public Service Distribution of estimated true validities
Assumed artifacts Test ‘Y Pe
Memory (M)
Total
N
801 Quantitative (Q) 1.206 Reasoning (R) 4,374 SP/M 1.422 Verbal (V) 3,943 V+R 1,151
No. of i s
6
p
26
.40 .59 .57 .49 .62
7
.75
9 24 13
90% SDp
0
90% C.V.
Sample-based artifacts p
SDp
C.V.
0
.41 .38 .47
.41 .63
0 .22
.40 .34 .39 .49 .34
.64
.20 .II 0 .21
0
.75
.71
0
.I9 .I4
.61
so
so .37 .71
making it more appropriate to attenuate by (.80)!4;80 is more representative of the reliabilities of well constructed commercial tests. Thus the true validity distributions, means, standard deviation, and 90% confidence intervals presented are for tests with reliabilities of .80 (for both the sample-based artifact distributions and the assumed artifact distributions). Comparing the distribution of estimated true validities and standard deviations based on sample-derived versus assumed artifact distributions, it can be seen that for five of the six test types that could be analyzed both ways, use of the sample-based artifacts produced slightly higher estimated mean true validities and 90% confidence intervals. The standard deviation of true validities increased slightly for the quantitative test type and dropped slightly for verbal and reasoning test types. File-drawer analyses. Because of the relatively small number of studies in our distributions (we now regard a combined sample size in the hundreds or thousands, pooled across six or more studies, as small, when only ten years ago a sample size of 30 in one study was regarded as sufficient for validation purposes), we conducted a so-called file-drawer analysis (Pearlman, 1982, Rosenthal, 1979). As originally developed by Rosenthal, this procedure allows one to ascertain the number of missing studies averaging zero validity needed to bring the significance level of the mean observed r for each distribution to .05. However, because personnel psychologists are interested in practical rather than statistical significance, an alternative procedure suggested by Pearlman ( 1 982) appears more appropriate for validity studies. Pearlman suggests computing the number of missing studies averaging zero validity that would be required to bring the mean observed validity down to a specified value of interest, rather than to the “just significant” level. Using the formula he provides, we computed this for a validity value of .10 for our distributions. Results are shown in Table 9.
PERSONNEL PSYCHOLOGY
412
TABLE 9 Number of Null Result Studies Needed to Reduce Mean Observed r of Present Validity Distributions (Training Criteria, Assumed Artifacr Distributions) to .I0
Number of studies in the validity distribution
Test tYPe Memory (M) Quantitative (Q) Reasoning ( R ) Spatial/Mechanical (SP/M) Verbal (V) Verbal Reasoning ( V + R )
+
6
9 21 12 26 7
Number of file drawer studies needed to reduce validity to .I0 7 22 48 23 70 26
The number of unretrievable null studies required to drop the validity to .10 is generally substantial, and it appears unlikely that the necessary number of studies exist. The one possible exception is the memory test type, where adding only 7 null studies would result in average validity of .lo. Results for Job Performance Criteria Situational specijcity. Results for the hypothesis of situational specificity for data based on job proficiency criteria are presented in Table 10. For assumed artifact distributions, using the 75% rule, the situational specificity hypothesis is rejected for reasoning and spatial/ mechanical tests types and the composite test type that included a noncognitive component (human relations and/or driving skills). Because of this, the only parameter of concern for these test types is the mean of each true validity distribution. The average percentage of variance accounted for across all cognitive test type distributions is 67% when figures larger than 100% are rounded down to 100%. This is lower than both the average for training success in this study and the average found in other validity generalization research. If the actually computed percentages are used, and weighted by the sample size, the average percentage of variance accounted for rises to 79%. This figure is consistent with the results of other validity generalization studies. The results using sample-based artifact distributions showed only slight differences from those obtained with the assumed artifact distributions, but these differences were uniformly in one direction. In the four cases when the two sets of results differed, the SD predicted from the sample-based artifacts was always slightly lower, although never
HIRSH, NORTHROP, AND SCHMIDT
413
TABLE 10 Situational Specificity Hypothesis Results for Police and Detectives in Public Service (Performance Criteria), Using Assumed and Sample-Based Artifact Distributions
Test type Cognitive Memory (M) Psychomotor (P) Quantitative (Q) Reasoning (R)
Total Number N of r’s
i
25 ,051 3.028 ,078 1,029 9 ,130 1.188 8 29 ,081 3,175 29 .087 SP/M 3,536 ,089 2.207 18 Verbal (V) 7 ,124 V+R+M+Sp/M 828 ,091 Total/Means 14.991 125 Composite including noncognitive component V + M + R + D Human Relationsb 868 15 ,145
Assumed Sample-based artifact dist. artifact dist. Var. 96 Var. 96 Obs. Pred. acct. Resid. Pred. acct. Resid. SD SD fora SD SD fora SD ,090 64 ,068 ,093 66 .066 ,084 44 ,095 .096 90 ,033 ,092 100 . 00 ,091 63 .069 .091 34 ,127 ,091 65.85 ,065
.ll ,091 64 .I14 ,094 69 ,127 .085 45 .I01 ,096 90 .083 .092 100 .I 14 ,091 63 .157 ,093 35 ,116 ,092 66.57
,067 .063 ,094 .033
,139 .I34
,039 Not available
.OO .069 ,126 ,064
+
92
-
aWhere predicted variance exceeded the observed variance, the percentage of variance accounted for is shown as 100%. bNo sample-based artifact information was available for this test type.
by more than .002. In the three cases where the percentage of variance accounted for by the artifacts differed, SD was always lower in the sample-based artifact results by 1-3 percentage points. The residual standard deviation changed in five cases with the set of artifacts used: it was from .001 to .003 higher for the sample-based artifact results. The average differed by only .001. These differences are negligible and are probably due to the generally higher criterion reliability values in the sample-based artifact distribution. Possible problems with the criterion reliability values are discussed later in this paper. Validity generalization. The true validity distributions for the police and detectives in public service category are shown in Table l l. Considering first the results using the assumed artifact distributions, the estimated mean true validities for all test types are positive and greater than .lo, which has been used elsewhere in this study as a minimum level of useful validity. (As with the training data, they were computed using a test reliability value of .80.) This parameter, the estimated mean true validity, is the best estimate of true validity in a new setting involving the same job and test type. For the reasoning and spatial/mechanical test types, as well as for the composite with a noncognitive component, this is, in fact, the only parameter of
+ +
+
Memory (M) Psychomotor (P) Quantitative (Q) Reasoning (R) Spatial/Mechanical (Sp/M) Verbal (V) V + R + M +Sp/M V M R + Driving Human Relationsa aNo sample-based artifact information was
Test type
No. of r’s p
3,028 25 .11 1.029 9 .17 1,188 8 .28 3,175 29 .I8 3,536 29 .I9 2,207 18 .22 828 7 .27 868 I5 .31 available for this test type.
.N
Total
.I5 .27 .09
.oo
.I4 .14 .20 .07 53 69 82 86 100
78 73 99
15 81
91 99 100
90 83 99
-.08
.02 .09 .I9 .00 -.08 .20
.oo
.10 .I4 .26 .I7 .I7 .I8 .22
-.07 -.Ol .03
.08 .I7 .I4 .OO .22 -.07 Not available
.oo
.13 .I2 .I8 .07
75 86 92 99 100 89 82
49 63 a2 a4 I00 72 70
Distribution of estimated true validities Assumed artifacts Sample-based artifacts % of & of coeff. coeff. 90% with p ) 90% with pL SDp C.V. .01 .I0 p SDp C.V. .01 .I0
TABLE 11 Validity Generalization Results for Job Performance Distributions, Police and Detectives in Public Service
0 -?
0
HIRSH, NORTHROP, AND SCHMIDT
415
interest because, using the 75% rule, we concluded that the validities of these test types do not vary from situation to situation. Thus, these test types have fully generalizable validities of .18, .19 and .31, respectively, using a job performance criterion. If the far more conservative 90%credibility value is used to assess the generalizability of validities of the remaining test types, none can be said to have a useful, generalizable level of validity. As Table 11 shows, the 90% credibility values are below .I0 in all cases where it is a relevant parameter. A third criterion for evaluating validity generalizability is the one proposed by Callender and Osburn (1981). By direct analogy with procedures used in performing traditional tests of statistical significance, they have suggested that a conclusion of validity generalization should be reached whenever the 90% credibility value is positive. A significant result using traditional significance tests means only that the confidence interval does not include zero. Using this standard, we would conclude that tests of quantitative ability have validities that may be generalized, with 90% confidence, to a new situation. In order to get a fine-grained picture of these validity distributions, we extended the analysis slightly. For each test type, we computed the points in the estimated distribution of true validities at which validity reached .01 and .lo. The results of this analysis are presented in Table 10. It can be seen that the percentage of cases in which validity can be expected to reach or exceed .10 varies from a high of 86% for the reasoning test type to a low of 53% for the memory test type, while the percentage of cases in which validity is expected to reach or exceed .01 ranges from 91% for tests of quantitative ability to 75% for memory. Thus, for at least some test types, positive (albeit small) validities are found to exist in a large majority of situations. Nevertheless, the level of mean estimated true validities and the 90% credibility values are considerably lower than for the performance results from other validity generalization studies. Possible reasons for this are discussed later in this paper. Turning to the validity generalization results using the samplebased artifact distributions, the most important result is that conclusions about the generalizability of validities are in no case different from those reached using the assumed artifact distributions. In most cases, however, the estimated true validities and 90% confidence intervals were slightly lower when calculated using the sample-based artifact distributions. In a few cases, there was no difference between the results based on one set of distributions and those based on the other set. The small magnitude of the differences as well as the fact that the
416
PERSONNEL PSYCHOLOGY
same decision about situational specificity and validity generalization is reached with both sets strengthen the conclusion that the assumed artifact distributions produce results comparable to those obtained with artifact distributions drawn from the sample of studies being analyzed.
Analyses Pooled Across All Law Enforcement Occupations Similar analyses to those reported above, but based on validity coefficients pooled across all five law enforcement occupations yielded essentially identical results with respect to both situational specificity and validity generalizability.2 These findings suggest that the variation in validities across all law enforcement occupations is no greater than that within narrower single occupational groupings. Because only a small number of coefficients were added in the “pooled” analyses (maximum added for a single test type was 14), however, this conclusion is tentative.
Discussion The results of the present study indicate that measures of cognitive ability correlate with performance in job training for law enforcement personnel at about the same levels found in validity generalization studies for other occupations. The correlations for performance on the job, however, were smaller than those typically found for other occupations. This was true whether the assumed distribution of artifacts or those developed from sample data were used. The lower validities may be due in large part to the difficulties involved in developing a good criterion measure of job performance for police work. People in law enforcement occupations often work either alone or with only a partner. Thus, opportunities for supervisors to observe their behavior are limited. This fact of police life has been used (Eyde, Primoff, & Hardt, 1983) to support the proposition that the criterion-related strategy may not be an appropriate method for establishing validity in this occupation. Barrett (cited in Kirkland v. New York State Department of Correctional Services, 1974) has also suggested that when people work independently, it is not possible to get good criterion ratings because no one regularly sees what the incumbent is doing. The high average criterion reliabilities do not contradict this hypothesis, since a reliable criterion may be contaminated or deficient. In short, *More detailed information is available in Validity Generalization Results for Law Enforcement Occupations (Hirsh, H. R., Northrop, L. C., Schmidt, F. L., 1985), an Office of Personnel Management technical report.
HIRSH, NORTHROP, A N D SCHMIDT
417
the ratings criteria in the present data may have been relatively poor measures of job performance; use of a more valid performance measure might have indicated that cognitive abilities tests are as valid for police work as for other occupations. Another hypothesis for the low validities associated with job performance is that personality variables or interpersonal skills play a large role in determining proficiency as a police officer or detective. This hypothesis is consistent with the fact that the strongest correlate of job performance (.31) was the composite that included human relations skill and driving ability. However, it cannot be determined from the present data whether this is reliably higher than the correlation with job performance (.27) for two solely cognitive composite test types. Other validity generalization results have shown a weaker than typical correlation between measures of cognitive ability and job performance for sales clerk occupations, another area in which personality variables have been hypothesized to be important for job success (Schmidt, Hunter, Pearlman, & Caplan, 1980). In our search for validity studies we encountered a substantial number of studies relating noncognitive variables to law enforcement jobs. Those could shed some light on their role in determining law enforcement job performance. Even if the low job performance validities are taken at face value, however, it should not be concluded that cognitive ability testing has no place in police and detective selection procedures. Cognitive ability tests are highly valid for success in job training programs, and there is evidence that job knowledge of the type acquired in training programs plays a strong causal role in determining job performance capabilities (Hunter, 1983; Schmidt & Hunter, in press). The problems with our data may extend beyond questions of observability of job performance and the role of noncognitive traits. This hypothesis is predicated on a comparison of the present findings with Ghiselli’s (1973) mean reported observed validities of various test types for protective service work. The protective service job family is, as mentioned earlier, highly similar to our law enforcement category. Ghiselli found mean observed correlations ranging between .14 and .23, while ours for roughly analogous categories ranged from .I0 to .16. Although this comparison is not precise, because we did not use exactly the same test categories as he did, the categories are similar enough to make the comparison informative. If, for example, one views our two cognitive composites (verbal plus reasoning plus memory plus spatial/mechanical ability, and verbal plus quantitative plus spatial/mechanical) as measures of general intelligence, then our
418
PERSONNEL PSYCHOLOGY
mean observed validities of .I2 and .16, respectively, can be compared with the mean validity of .23 reported by Ghiselli. If we can equate our quantitative ability with his numerical ability, the observed mean validities are .15 and .18, respectively. Ghiselli’s validities for spatial and mechanical abilities are presented separately as .17 and .23; our finding for a combined spatial/mechanical ability was .lo. The observed mean validities for perceptual speed are .14 (ours) versus .21 (his), while for psychomotor ability his mean is .14 and ours is .12. On the other hand, even though Ghiselli’s means are higher than ours, they are near the lower end of mean observed performance validities in the range of occupations he surveyed. Taken together, these findings suggest that factors other than suboptimal job performance ratings may be operating to reduce the validity of cognitive aptitudes for prediction of law enforcement job performance in our study. The reason that both Ghiselli’s and our reported validities for protective service occupations are lower than for many other occupations may be of the lack of good performance measures for this group. If this is so, the differences between his results and ours could not be due to the observability problem: they must be attributed to another factor. We are uncertain of what this factor could be. It may be a difference in the mean reliability of the criteria, although no basis exists for hypothesizing such a difference. The differences in values between the assumed and sample-based artifact distributions were generally small. The range restriction values derived from information in the sample are very close to those assumed by Schmidt and Hunter. A comparison (Tables 4 and 6) shows that the assumed mean standard deviation value is 5.945 (relative to an unrestricted value of 10.0), while the mean values in our sample were 5.94 and 6.12 for training and job performance criteria, respectively. The average value for predictor reliability used by Schmidt and Hunter is .80 (Table 3). This is almost exactly the sample-based average (.803) across all test types for studies using a proficiency criterion. For studies using a training criterion, the average predictor reliability across all test types in our study was .757, which is somewhat lower than the Schmidt and Hunter assumed average of .80. Based on these data, we conclude that the Schmidt and Hunter range restriction and predictor reliability distributions are generally accurate and that they can be used with confidence when sample information is not available. The greatest discrepancy between assumed and sample-based distributions occurs for performance criterion reliability values. (It may be recalled that there were insufficient training criterion reliability values
HIRSH, NORTHROP, A N D SCHMIDT
419
to develop a sample-based distribution.) The average values of these were uniformly higher in the sample-based distributions than the average value of .60 for the assumed distributions. The range for individual test types was .67 to .79; the average across test types was .73. Although this difference seems substantial, the impact on important parameters of the true validity distribution is quite small. Differences in estimated mean true validities and 90% confidence values hovered about .02; conclusions about situational specificity and validity generalizability were unchanged. Thus, for operational purposes, it appears that even moderate differences in the estimated mean of criterion relabilities do not lead to different conclusions. Also, the conservativeness with which we treated criterion reliabilities makes it likely that our distributions, rather than Schmidt and Hunter’s, were inaccurate. For example, in several cases in the studies the type of reliability was unidentified. If these were estimates of internal consistency or intrarater agreement, they could be expected to be overestimates of the appropriate type of reliability (interrater reliability). We made no downward adjustments in our reliability estimates to account for this. It would be hard to contend that the sample-based reliability values are better estimates of actual mean criterion reliability values than those based on careful study of the literature as a whole (King, Schmidt, & Hunter, 1980).In summary, this study demonstrates that cognitive ability tests are excellent predictors of performance in job training programs for police and detectives and that the use of assumed rather than sample-based distributions of artifacts makes little or no difference in results and conclusions. We can also conclude that several types of cognitive tests have at least a minimally useful level of validity for predicting job performance of law enforcement personnel in a majority of situations. There are uncertainties about the degree of validity of the job performance criteria, however, and actual validity may be substantially higher. To resolve these uncertainties, we recommend that additional validity studies be conducted on law enforcement occupations and that these studies be incorporated into future meta-analyses. REFERENCES Callender JC, Osburn HG. (1981). Testing the constancy of validity with computergenerated sampling distributions of the multiplicative variance estimate: Results for petroleum industry validation research. Journal of Applied Psychology, 65, 214-28 1.
Campbell JP. (1982). Editorial. Some remarks from the outgoing editor. Journal of Applied Psychology, 67, 61 1-700.
420
PERSONNEL PSYCHOLOGY
Ekstrom RB, French JW, Harman HH. (1976). Manua/ for kit offactor-referenced cognitive tests. Princeton, N J: Educational Testing Service. Eyde LD. Primoff ES, Hardt RH. (1983). What should the content of content validity be? (OPRD 83-5). US. Office of Personnel Management, Office of Personnel Research and Development. (NTIS PB84 118 017) Chiselli EE. (1973). The validity of occupational aptitude tests in personnel selection. PERSONNEL PSYCHOLOGY, 26, 461-471. Hirsh HR. Northrop LC, Schmidt FL. (1985). Validity generalization results for law enforcement occupations. U.S. Office of Personnel Management, Office of Staffing Policy. Hunter JE. (1983). A causal analysis of cognitive ability, job knowledge, job performance, and supervisory ratings. In Landy F, Zedeck S, Cleveland J (Eds.), Performance measurement and theory (pp. 257-266). Hillsdale, NJ: Erlbaum. King LM, Hunter JE. Schmidt FL. (1980). Halo in a multidimensional forced-choice performance and evaluation scale. Journal of Applied Psychology, 65, 502-516. Kirkland et al. v. New York State Department of Correctional Services, Barrett Testimony. 374 F. Supp. 1361 (1974). Lent RH, Auerbach HA, Levin LS. (1971). Predictors, criteria, and significant results. PSYCHOLOGY, 24, 519-533. PERWNNEL Pearlman K. (1979). The validity of tests used to select clerical personnel: A comprehensive summary and evaluation (TS-79-1). Washington, DC: US. Office of Personnel Management, Office of Personnel Research and Development Center. Pearlman K. (1982). The Bayesian approach to validity generalization: A systematic examination of the robustness of procedures and conclusions. Unpublished Doctoral Dissertation, The George Washington University, Washington, DC. Pearlman K, Schmidt FL, Hunter JE. (1980). Validity generalization results for tests used to predict job proficiency and training in clerical occupations. Journal of Applied Psychology, 64. 373-406. Rosenthal R. (1979). The “file drawer problem” and tolerance for null results. Psychological Bulletin, 86. 238-241. Schmidt FL, Cast-Rosenberg I, Hunter JE. (1980). Validity generalization results for computer programmers. Journal of Applied Psychology, 65, 643-66 1. Schmidt FL, Hunter JE. (1977). Development of a general solution to the problem of validity generalization. Journal of Applied Psychology, 62, 529-540. Schmidt FL, Hunter JE. (in press). The impact of job experience and ability on job knowledge, work sample performance, and supervisory ratings on job performance. Journal of Applied Psychology. Schmidt FL. Hunter JE. Pearlman K, Caplan JR. (1980). Validity generalization results for three occupations in the Sears Roebuck Company. Unpublished document. Schmidt FL, Hunter JE, Pearlman K, Hirsh HR. (1985). Forty questions and answers about validity generalization and meta-analysis. PERSONNEL PSYCHOLOGY, 38 697-198. Thorndike RL. (1949). Personnel selection. New York: Wiley. U.S. Department of Labor. (1977). Dictionary of occupational titles. (4th ed.). Washington, DC: U.S. Government Printing Office.