Behaviorally Anchored Rating Scales - American Psychological ...

Journal of Applied Psychology 1977, Vol. 62, No, 3, 278-282

Behaviorally Anchored Rating Scales: Effects of Education and Job Experience of Raters and Ratees Wayne F. Cascio and Enzo R. Valenzi Florida International University The effects of two levels of rater and ratee experience and education, as well as their possible interaction, on behaviorally anchored rating scales were considered. A total of 370 male police personnel participated, of whom 71 were sergeants and 299 were police officers. Eight dependent variables, each a 9point behaviorally anchored rating scale describing one dimension of police officer performance, were subjected to fixed-effects, unweighted-means analyses of variance. Results indicated that raters' experience and raters' education accounted for most of the statistically significant effects. Likewise Raters' Experience X Education and Raters' Education X Ratees' Education interactions were statistically significant. All significant effects were weak, however, as indicated by overlaps of 82%-92% between distributions, and eta-squares for all significant F ratios of .01-.03. Hence, neither rater nor ratee characteristics exerted any practically significant effects on observed behaviorally anchored ratings.

Although many different techniques, procedures, and formats are available for appraising employee performance (Barrett, 1966), behaviorally anchored rating scales (BARS) have been widely recommended (Campbell, Dunnette, Arvey, & Hellervik, 1973; Campbell, Dunnette, Lawler, & Weick, 1970; Dunnette, 1966). In their review of BARS research, however, Schwab, Heneman, and DeCotiis (197S) concluded that there is little reason to believe that BARS are superior to alternative evaluation instruments in terms of a reduction in leniency or halo effects. Schwab et al. speculated that such results may be due to differences in BARS development procedures (subsequent!}' confirmed by Bernardin, LaShells, Smith, & Alvares, 1976), but future evaluation research must move beyond manipulating only the characteristics of the This research was conducted with the support of Police Foundation Grant 73-12 to the Bade County Public Safety Department, Miami, Florida. The authors would like to acknowledge the cooperation and assistance of Leslie Real, James Bryant, and Hermio Sanchez throughout all phases of the research. Requests for reprints should be sent to Wayne F. Cascio, Florida International University, Tamiami Trail, Miami, Florida 33199.

instrument itself. Other sources of evaluation score variance must be considered. Likewise, Zedeck, Jacobs, and Kafry (1976) called for an examination of the individual differences associated with perceptions of behaviors. No such evidence exists for BARS, although investigations of individual differences in raters or in ratees have been reported for other evaluation procedures. Thus Mandell (1956), Schneider and Bayroff (1953), and Rothe (1949) reported that individual difference variables associated with raters and ratees, such as tenure, activities and achievements, and age, influenced the severity, leniency, and validity of performance ratings. The present study was designed to fill this void in BARS research by considering the effects of two individual differences variables, education and experience of both raters and ratees (as well as their possible interaction). Based on previous research and in-depth interviews, the following three hypotheses were investigated: 1. Less experienced raters will be more severe in their ratings than more experienced raters. 2. Highly educated raters in general will be more severe in their ratings than less educated raters regardless of ratee characteristics.

278

EDUCATION AND EXPERIENCE EFFECTS

279

Table 1 Behaviorally Anchored Rating Scales and Definitions Definition

Scale Job Knowledge Judgment Initiative Dependability Demeanor Attitude Relations with others Communication

Awareness of procedures, law, and court rulings, and changes in them. Observation and assessment of the situation and taking appropriate action. Individual personal performance conducted without either direct supervision or commands, including suggestions for improved departmental procedures. Predictable job behaviors, including attendance, promptness, and reaction to boredom, stress, and criticism. Professional bearing as determined by overall neatness of uniform, personal grooming, and general physical condition. General orientation toward the law enforcement profession and the department. Ability to deal with people contacted during the performance of patrol duties, including public, fellow officers, and supervisory personnel. Ability to make oneself understood and gather and transmit information, both in oral and written fashion.

3. More experienced raters will be more severe than less experienced raters in their ratings of highly educated ratees.

Method Sample The research was carried out in a large metropolitan police department. The total sample was composed of 370 males, of whom 71 were raters and 299 were ratees. All raters were chosen from the same organizational level (sergeants) in order to control for sources of unreliability associated with raters from different organizational levels (Borman, 1974). In the rater sample there were 69 Caucasians and three blacks. In the ratee sample (police officers) there were 245 Caucasians, 32 blacks, and 22 Latins. Although the potential sample size of raters and ratees was much larger, subjects were dropped if, in each pair of raters and ratees, complete information on both could not be obtained.

Design The total sample of raters and ratees was divided into subgroups of high education (completed at least 2 years of college), low education (less than 2 years of college), high-experience raters (6 years or longer in grade), low-experience raters (less than 6 years in grade), high-experience ratees (3 or more years in grade), and low-experience ratees (less than 3 but more than 1 year in grade). The unique combination of rater, ratee, education, and experience yielded a 2 4 or 16-celled design. Hence, one cell in the design would be defined, for example, by a low education rater with high experience who rated a ratee with high education and low experience.

Education and experience were chosen because up until several years ago college-educated police officers were rare. Most had high school or associate degrees. As the police officer's role has become more complex, however, there has been increasing pressure to recruit and select college-educated officers. There is evidence that they perform more effectively (Cascio, 1977). The cutoffs (6 years and 3 years, respectively) used to dichotomize high- from low-experienced raters and ratees were set on the basis of indepth interviews with operating-line personnel from several organizational levels. To check whether ratee assignments were done systematically according to education and age (e.g., high- and low-education ratees supervised by high-and low-education raters), two chi-square analyses were performed, one for education and one for age. Neither was significant, indicating that ratee assignment to raters was haphazardly accomplished on the education and age variables. There were eight dependent variables, corresponding to the eight BARS dimensions rated. Each dependent variable was a rating on a 9-point BARS. The design therefore permitted the partitioning of variance in the ratings into nonoverlapping components attributable to four main effects and all their interactions.

Dependent Variables BARS for police officers were developed within the organization from which the present sample was drawn in the course of a 2-year research program on police officer selection and performance appraisal. Standard BARS development procedures were followed (Smith & Kendall, 1963), including the retranslation technique, to establish the relevance of the eight BARS developed for measuring police officer performance. The scale names and definitions are presented in Table 1. Interrater reliabilities in the

WAYNE F. CASCIO AND ENZO R. VALENZI

280

Table 2 Intercorrelation Matrix of all Independent and Dependent Variables Variable 1. 2. 3. 4. 5. 6. 7. 8. 9, 10. 11. 12.

1

Rater's experience Rater's education Ratee's experience Ratee's education Job Knowledge Judgment Initiative Dependability Demeanor Attitude Relations with others Communications

2

.47**

.i

4

S

6

7

.11

.01

.12*

.10

.15**

.05

.04

.16**

.07

.17**

.20**

.17**

.12*

.09

.09

.06

-.01

.91**

.87** .87**

8

9

10

11

.11 .02 .02

.09

.13*

.12*

.05 .08

-.03. .85** .88** .89**

-.07 .84** .85** .87** .88**

.14** .14* -.02 -.08 .85** .85** .90** .90** .88**

.08 .11 .04

.15** .06

-.02 .84** .88** .86** .89** .85** .89**

12

.87** .88** .87** .87** .85** .88** .87**

*p < .05. ** p < .01.

present organization were in the mid-.80s across dimensions. Further description of the developmental procedures and psychometric properties of the police officer BARS ma)' be found in Landy and Farr (Note 1). All ratings were obtained over a 3-week interval in August 1974. Each rater rated approximately four individuals. Inspection of the frequency distribution of each of the eight BARS indicated that all were slightly negatively skewed but approximately normal. Analysis BARS ratings for each of the eight separate dimensions were subjected to a fixed-effects, unweightedmeans analysis of variance. Because of small A7 sizes in certain cells of the complete 16-ccll design, no attempt was made to interpret those results. Therefore, the 4 possible three-way analyses of variance (formed by collapsing the design across each of the independent variables one at a time) were run for each of the 8 BARS, resulting in 32 analyses of variance. To clarify further significant main effects, Tilton's (1937) overlap statistic and eta-squares were computed. The complete Intel-correlation matrix of the 4 independent and 8 dependent variables is shown in Table 2.

Results Despite the high intercorrelations of the dependent variables, there was sufficient unique variance within each dimension to yield differential treatment effects across the eight BARS. As can be seen in Table 3, most of the significant effects were due to the raters' experience and raters' education. Raters' experience was significant for five of eight BARS, and raters' education was significant for four of eight BARS. Ratees' experience was significant for only two of eight BARS; ratees' education failed to reach significance in any analysis. Examination of the profile for the two significant two-way interactions suggested that both were disordinal, that is, the profiles intersected rather than diverged (Lubin, 1961). However, given that only two of the many possible interactions reached significance and

Table 3 Significant F Ratios for Each Three-Way Analysis of Variance Performed on the Behavior ally Anchored Rating Scales Scale Job Knowledge Judgment Initiative Dependability Demeanor Attitude Relations with others Communications

D

D 4.3*

8.0** 3.8*

4.6*

4.4* 4.1*

3.7*

5.3* 3.8*

4.3*

3.9* 4.5*

6.7**

4.3*

4.3*

6.1* 5.0* 4.4*

6.4**

6.6**

8.6** 7.5** 7.3**

Note. A = rater's experience; B = rater's education; C = ratees' experience; D = ratee's education. Eta-squares for all significant F ratios ranged from .01 to .03. " The A X B and B X D interactions for the Judgment scale were significant at p < .05, i)a = .02. * p < .05. **p < .01.

EDUCATION AND EXPERIENCE EFFECTS

281

Table 4 Means and Standard Deviations jor Significant Main Effects for tlie Beliaviorally Anchored Rating Scales Rater experience Scale

Low

High

Rater education

Ratee experience

Low

High

Low

High

6.8

6.2 1.6

6.2 1.6

6.9 l.S

6.6 1.4

6.9 1.5

Job Knowledge M SD

1.6

Judgment M SD

Initiative M SD

1.7

7.0 1.6

6.6 l.S

7.0 1.5

6.4

7.0 1.7

6.4 1.5

6.9 1.4

6.4 1.5

6.8 1.6

6.3 1.6

Dependability M SD

Demeanor M SD

Attitude M SD

6.3 1.7

6.8 1.5

6.5 1.7

7.0 1.6

6.5 1.5

7.0 1.5

Relations with others M SD

Communications M SD

Note, Tilton's overlap statistic computed for each pair of significantly different means varied from 82% to 92%.

that the associated eta-squares were slight (.02), we do not emphasize their importance to the overall analysis. Table 4 shows the means, standard deviations, and Tilton's overlap statistic for each scale. For all five BARS yielding significant F ratios for raters' experience, the raters with high experience gave higher mean ratings. The percentage overlap for the five contrasts was uniformly high, between 86% and 88%. Similarly for the rater's education, in those four BARS with significant F ratios, the raters with low education gave higher mean ratings. Again the percentage overlap was uniformly high, between 85% and 96%. Ratees' experience was significant for two BARS, and in both cases, ratees with high expenence received higher mean ratings. Again, the percentage overlap was very high, ranging from 82% to 92%. Although Hypotheses 1 and 2 were statistically supported, none of ,, . . . . . . , ,.,, , ' the statistical differences between groups was practically significant. Eta-squares for all sig-

nificant F ratios ranged from .01 to .03, and Tilton's overlap statistic for each pair of significantly different means varied from 82% to 92%. Discussion In the Present stucly> failure to re J ect the nul1 h the

yP° sis would be taken as evidence independent variables did not influence the BARS ratings. The weakness of the obtained significant differences therefore s ugSests that the ratinSs were not affected by the independent variables in any practically significant way.1 This has favorable imphca1!t is P°ssible of coul'se that factors other than ° n there ex ist substantial true performance differences between high- and low-experience ratees on some dimensions that are not reflected in the ratings, then a significant ratee effect would exist. However, the positive relationships observed between ratee expenence (and education) and job knowledge (which make good conceptual sense), coupled with the genthat the

282

WAYNE F. CASCIO AND ENZO R. VALENZI

tions for BARS use in performance appraisal. As can be seen in Tables 2 and 4, however, halo and leniency effects were not removed by using BARS. These results are consistent with previous research (Schwab et al., 1975). Despite thorough involvement of raters in scale development, training in the rating procedure and potential contaminants, leniency and halo effects were still evident. One explanation for the halo effect stems from a special problem associated with measuring police performance, namely, officers are seldom directly observed by their supervisors, A priori, therefore, we might expect supervisors to form a general impression of the ratee and then proceed to rate all BARS in a manner consistent with this general impression. With regard to leniency, raters may feel that those who would have been rated unfavorably have already been discharged from the organization (Bass, 1956). Rigorous selection and training hurdles tend to eliminate potentially poor performers, thus contributing to a high standard of performance that becomes in fact the average or expected standard. Hence we may be observing the ratings in a context in which it is more appropriate to speak of a leniency effect instead of a leniency error. A priori, one might also expect to find some degree of criterion contamination. For example, if a rater knows that the ratee is highly experienced, then there might be a bias toward higher ratings on "judgment," regardless of the ratee's true score. On the other hand, the present results indicate that there was no such systematic upgrading or downgrading because of rater or ratee characteristics. The circumstances under which the ratings were obtained limit the generality of these results. Raters were told that the ratings were to be used for research purposes only. The combined effects of the immediately preceding training session and research use of the ratings are unknown. We hypothesize that the greater the extent to which the training "took" and raters were convinced that ratings would not be used for personnel action, the erally low correlations between the two ratee characteristics and performance on other dimensions, argues indirectly for assuming a high relationship between obtained and true performance scores.

more accurate the appraisals. It is of course an open question whether ratings obtained for actual personnel use would yield different results. Reference Note 1. Landy, F. J., & Farr, J. L. Police performance appraisal (Tech. Rep. to Law Enforcement Assistance Administration, Grant N1071-063-6). University Park: Pennsylvania State University, Department of Psychology, September 1975.

References Barrett, R. S. Performance rating. Chicago: Science Research Associates, 1966. Bass, B. M. Reducing leniency in merit ratings. Personnel Psychology, 1956, 9, 359-369. Bernardin, H. J., LaShells, M. B., Smith, P. C., & Alvares, K. M. Behavioral expectation scales: Effects of developmental procedures and formats. Journal of Applied Psychology, 1976, 61, 75-79. Borman, W. C. The rating of individuals in organizations: An alternative approach. Organizational Behavior and Human Performance, 1974, 12, 105-124. Campbell, J. P., Dunnette, M. D., Arvey, R. D., & Hellervik, L. V. The development and evaluation of behaviorally based rating scales. Journal of Applied Psychology, 1973, 57, 15-22. Campbell, J. P., Dunnette, M. D., Lawler, E. E., Ill, & Weick, K. E. Jr. Managerial behavior, performance, and effectiveness. New York: McGraw-Hill, 1970. Cascio, W. F. Formal education and police officer performance. Journal of Police Science and Administration, 1977, /, 89-96. Dunnette, M. D. Personnel selection and placement. Belmont, Cal.: Wadsworth, 1966. Lubin, A. The interpretation of significant interactions. Educational and Psychological Measurement, 1961, 21, 807-817. Mandell, M. M. Supervisory characteristics and ratings. Personnel, 1956, 32, 435-440. Rothe, H. F. The relation of merit ratings to length of service. Personnel Psychology, 1949, 2, 237-242. Schneider, D. E., & Bayroll, A. G, The relationship between rater characteristics and the validity of ratings. Journal of Applied Psychology, 1953, 31, 278-280. Schwab, D. P., Heneman, H. G., Ill, & DeCotiis, T. Behaviorally anchored rating scales: A review of the literature. Personnel Psychology, 1975, 28, 549562. Smith, P. C., & Kendall, L. M. Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 1963, 47, 149-155. Tilton, J. W. The measurement of overlapping. Journal of Educational Psychology, 1937, 28, 656-662. Zedeck, S., Jacobs, R., & Kafry, D. Behavioral expectations: Development of parallel forms and analysis of scale assumptions. Journal of Applied Psychology, 1976,61, 112-115. Received March 19, 1976 •