HUMAN PERFORMANCE, 13(4), 355–370 Copyright © 2000, Lawrence Erlbaum Associates, Inc.
Effects of the Rating Process on the Construct Validity of Assessment Center Dimension Evaluations Chet Robie, Hobart G. Osburn, Mark A. Morris, and Jason M. Etchegaray Department of Psychology University of Houston
Kimberly A. Adams American Institutes of Research
Assessment center research routinely finds that correlations among dimensions within exercises are stronger than correlations within dimensions across exercises (i.e., the exercise effect). A study was designed to examine whether the commonly observed exercise effect associated with assessment center ratings could be explained in terms of the rating process. One hundred undergraduate students participated in a simulated assessment center that included 2 exercises and 4 dimensions. Fourteen trained undergraduates participated as assessors. Assessment center dimensions were rated using either a within-exercise rating process in which an assessor rated all dimensions within 1 exercise or a within-dimension rating process in which an assessor rated 1 dimension across all exercises. It was hypothesized that a within-exercise rating process would result in exercise factors and a within-dimension rating process would result in dimension factors. Confirmatory factor analyses of the 2 rating processes strongly supported the hypotheses. Implications for the design of assessment centers and future research are discussed.
Assessment centers consist of a collection of behavioral exercises designed to tap constructs that have been identified via job analyses to be important to job success. For several decades, assessment centers have become popular as a means of identifying and developing both managerial and nonmanagerial personnel. The technique has a high degree of utility in predicting various aspects of success with an Requests for reprints should be sent to Chet Robie, Personnel Decisions International, 2000 Plaza VII Tower, 45 South 7th Street, Minneapolis, MN 55402–1608. E-mail:
[email protected]
356
ROBIE ET AL.
uncorrected mean validity of approximately .30 based on more than 100 validity coefficients representing more than 12,000 individuals (Gaugler, Rosenthal, Thornton, & Bentson, 1987). It has frequently shown an absence of adverse impact (Thornton & Byham, 1982) and has consistently shown a high degree of content validity (Norton, 1977; Sackett, 1987). Assessment centers have flourished as an important area of interest for researchers. Almost from the inception of assessment center research, however, researchers have questioned whether the typical application of the assessment center technique provides valid inferences of the constructs it purports to measure. There are several methods for generating ratings in assessment centers (Thornton, 1992). The first method, known as the behavioral reporting method, requires assessors to observe behavior in all exercises and then make overall dimension ratings. The second method, known as the exercise–dimension method, requires assessors to observe behavior in all exercises and then make dimension ratings for each exercise. The third method, known as the within-exercise method, requires assessors to make ratings for each dimension after the completion of each exercise. The fourth method, known as the across-exercise method, requires each assessor to observe each candidate’s performance in only one exercise and then make dimension ratings only for that exercise. Contrary to expectations, research utilizing the latter three methods has routinely found that correlations among dimensions within exercises are stronger than correlations within dimensions across exercises (e.g., Bycio, Alvares, & Hahn, 1987; Robertson, Gratton, & Sharpley, 1987; Sackett & Dreher, 1982; Sackett & Harris, 1988; Turnage & Muchinsky, 1982). Thus, assessment center ratings seem to best be explained by performance on individual exercises as opposed to performance on individual dimensions across different exercises (i.e., exercise factors emerge instead of dimension factors). This implies that the dimension ratings are only representative of behavior in a particular situation or exercise (i.e., they are situation specific). These findings have prompted some scholars to suggest that assessment center dimension evaluations are not construct-valid personnel tools due to their lack of discriminant validity (e.g., Schmidt, Ones, & Hunter, 1992). Several reformations of these common assessment center designs were made in attempts to ameliorate the situational specificity effect. These solutions included such tactics as reducing the number of dimensions that an assessor evaluates during the assessment center (e.g., Gaugler & Thornton, 1989), implementing behavioral checklists that categorize behaviors according to dimensions (e.g., Reilly, Henry, & Smither, 1990), or allowing an assessor to evaluate performance after each exercise as opposed to having the assessor evaluate performance after all exercises (e.g., Sackett & Dreher, 1982). However, none of these different design modifications markedly improved the discriminant validity of the assessment centers under study. It is important to realize that each solution was proposed based on the assumption that the lack of discrimination between dimensions was the result
CONSTRUCT VALIDITY
357
of assessors making rating errors due to the excessive cognitive demands placed on them. The situational specificity effect also may be a result of assessors’ use of prototypes. Zedeck (1986) suggested that the situational specificity effect may be due to assessors’ use of different management behavior schemata for different exercises; this would result in limited cross-situational consistency from exercise to exercise and would produce spuriously high correlations between different dimension ratings within the same exercise. This explanation is akin to a form of halo in which correlations between categories are higher in magnitude than expected (Berman & Kenny, 1976; Cooper, 1981). The halo effect has been well documented in the performance appraisal literature that dates back to the early 1900s. A large body of research has been aimed at identifying the sources of the halo effect and eliminating it. However, despite these attempts, the halo effect continues to prevail so that the category ratings made by one evaluator are correlated higher than reality (Berman & Kenny, 1976; Cooper, 1981). Furthermore, ratings made by different evaluators usually are not as highly correlated as the ratings made by a single evaluator because it is unlikely they are using the same performance-related schemas or prototypes, especially if they have different contextual information (Feldman, 1981). Therefore, it is plausible to hypothesize that the halo is affecting assessors’ ratings of candidate performance given that the typical design of many assessment centers requires assessors to make ratings across many different exercises and dimensions (Bray, Campbell, & Grant, 1974). The notion that the situational specificity effect may be due to some form of cognitive error, such as cognitive demands, or some form of cognitive heuristic, such as prototypes, has prompted researchers to study the effect of altering the rating process on the construct validity of assessment center dimension evaluations. Several studies have been conducted that could be viewed as tests of the effect of the rating process on the construct validity of assessment center dimension ratings (Adams, 1997; Harris, Becker, & Smith, 1993; Silverman, Dalessio, Woods, & Johnson, 1986). The study by Silverman et al. (1986) examined the effects of two alternative scoring methods on construct validity. One of the methods, the within-exercise approach, required assessors to complete ratings of the candidate on all relevant dimensions within each exercise. The assessors would proceed to make ratings for the next exercise after the dimensions within the previous exercise had been rated. The second method, the within-dimension approach, required assessors to rate the candidates on the same dimension across all exercises. The assessors would proceed to make ratings for the next dimension after the previous dimension had been rated for all relevant exercises. Silverman et al. reported a greater degree of cross-situational consistency for the within-dimension method. However, exercise factors remained salient in the within-dimension data. Harris et al. (1993) suggested that the results found by Silverman et al. (1986) may have been caused by an artifact. Specifically, the within-dimension procedure
358
ROBIE ET AL.
involved having the assessors first arrive at an overall dimension rating, followed by a rating of the dimension in each of the exercises. Harris et al. were concerned that requiring assessors to first determine an overall rating may have artificially forced them to be more consistent in their dimension ratings across exercises. Harris et al. thus conducted a study in which they replicated Silverman et al.’s within-exercise method and designed a more precise test of the within-dimension method that involved having the assessors first rate each dimension across exercises and then rate each dimension overall. Contrary to Silverman et al., the results showed that the within-dimension method did not increase the cross-situational consistency of assessment center ratings. The authors suggested that, based on the results of their study, it remained unclear whether: (a) behavior really differs from situation to situation, (b) behavior is truly quite consistent across situations but assessors are unable to accurately rate behavior because of their own schema-based processing, or (c) some behaviors are unobserved due to variations in the opportunities candidates have to display dimension-related behaviors during different exercises. They further suggested that the failure of their data to support the hypothesis that a scoring method can improve cross-situational consistency in assessment centers indicates that other manipulations (or possibly other explanations) are needed. To complicate comparison of their results, each assessor observed a given candidate across all exercises in the Silverman et al. study, whereas each assessor observed only one fourth of the exercises for a given candidate in the Harris et al. study. Thus, in the Harris et al. study, assessor evaluations were based in part on descriptive reports provided by other assessors. Most recently, Adams (1997) designed a study that represented an attempt to provide another explanation for the lack of cross-situational consistency in assessment center ratings by using a different analytic methodology than that used in either the Silverman et al. (1986) or Harris et al. (1993) studies. Adams hypothesized that the presence of exercise factors was due to halo effects given that in all previous studies the same raters rated more than one dimension in the same exercise. In Adams’s study, assessors directly observed a given candidate on all exercises and rated candidates on all dimensions for each exercise; however, ratings from each assessor were used from either: (a) all the dimensions in a single exercise (what she termed a within-exercise rating process), or (b) a single dimension across exercises (what she termed a within-dimension rating process). Thus, to obtain dimension ratings, Adams discarded three fourths of each assessor’s data such that, for a given candidate, each assessor only provided ratings for a single dimension across all exercises and no two assessors provided ratings for the same dimension. Exploratory factor analyses on each process showed that exercise factors were obtained in the within-exercise rating process data and dimension factors were obtained in the within-dimension rating process data. Several design flaws weakened interpretation of the Adams (1997) data. First, the rating process was not manipulated; that is, assessors did not use different rat-
CONSTRUCT VALIDITY
359
ing processes for a given candidate—instead, rating data were systematically discarded to form within-exercise and within-dimension matrices of correlations. Lack of experimental manipulation in this case opens the results to third variable interpretations. Second, three videotaped actors performing at low and high performance levels were used in varying orders to represent candidate performance in a simulated assessment center. Undergraduate students participating for course credit served as assessors. Use of simulated candidates (a) is not representative of how assessment centers are actually conducted in practice, (b) may have increased the variance in behavior relative to what one would encounter in an applied setting, and (c) may have led to the observed high correlations among the dimension ratings (average r = .84 for the within-exercise rating process and average r = .82 for the within-dimension rating process). This study was an attempt to conduct a study that did not suffer from some of the critical design flaws of preceding studies. First, we experimentally manipulated the rating process such that trained undergraduate assessors rated dimensions according to either a within-exercise or a within-dimension rating process (see Research Design and Procedures section). In contrast to the three previous studies cited earlier, assessors using the within-dimension rating process only made ratings on one dimension for a given candidate and assessors using the within-exercise rating process only observed candidate behavior in one exercise. Second, we used candidates for a simulated assessment center instead of videotaped actors. Our first hypothesis was that assessor ratings for the within-exercise rating process would exhibit the presence of exercise factors due to strong, positive relations between ratings of different dimensions evaluated within one exercise by one assessor. We expected that the situational specificity effect would be present using this rating process due to either the provision of an opportunity for the halo effect to occur between dimension ratings within the same exercise or the high cognitive demands placed on the raters because each rater would be making a total of four ratings (i.e., rating four dimensions within one exercise). Our second hypothesis was that assessor ratings for the within-dimension rating process would exhibit dimension factors due to strong, positive relations between ratings of the same dimension across all exercises. We expected that the situational specificity effect would be ameliorated using this rating process due to either the elimination of the opportunity for the halo effect to occur between dimension ratings within the same exercise or the low cognitive demands placed on the raters because each rater would be making a total of only two ratings (i.e., rating one dimension for each of two exercises). METHOD Participants
360
ROBIE ET AL.
One hundred undergraduate students from a large urban university in the southwestern United States participated as candidates in an assessment center simulation. The participants received extra credit points toward psychology courses for their participation in the study (i.e., the “candidates” were not actually applying and being evaluated for the position of resident assistant). Although we did not collect demographic information on the candidates, the general student body at the university is extremely diverse, composed of approximately equal percentages of men and women. The population is approximately 56% White, 14% Asian or Pacific Islander, 13% Hispanic, 10% African American, and 7% international students. Fourteen undergraduate students participated both as confederates in running the assessment center and assessors in making dimension ratings in later stages of the project. Six of the confederates–assessors were men and 8 were women; 9 were White, 2 were Asian or Pacific Islander, 1 was African American, 1 was of Portuguese origin, and 1 was of Middle Eastern origin.
Description of Assessment Center An assessment center designed for the evaluation of resident hall assistant (RA) candidates was used. Gaugler and Thornton (1989) delineated the target job based on a thorough job analysis. The RA job was used because it was not a difficult job to understand and was potentially interesting to undergraduate students. Also, the RA job can be considered a managerial position, making it similar to most other jobs that utilize assessment centers as a selection device (Gaugler & Rudolph, 1992). Gaugler and Rudolph (1992) designed three exercises based on the job analysis data of the RA position. The study reported here used two of the three exercises: facilitation of a group meeting involving resident representatives and mediation between two roommates. They identified and defined four dimensions as necessary for successful performance in the RA position: oral communication, organizing and planning, analysis and judgment, and sensitivity. As Gaugler and Rudolph indicated, expert judges found the four dimensions distinguishable from each other in the three exercises. Definitions of each of the assessment center dimensions can be obtained from Chet Robie.
Research Design and Procedures Data collection occurred in two contiguous semesters. In the first semester, the undergraduate students participated as candidates in a simulated assessment center. Each candidate’s performance in the simulated assessment center was videotaped. Candidates were given 15 min to read a description of the assessment center and their role in that center. Specifically, in the first exercise, the candidates were re-
CONSTRUCT VALIDITY
361
quired to hold a floor meeting to explain the campus alcohol policy to two of their subordinate resident floor leaders (confederates). In the second exercise, the candidates were required to mediate a conflict between two of their residents (confederates). In each exercise, same-gender undergraduate research assistants served as confederates and were instructed to treat each candidate in a similar manner. Each exercise took approximately 20 min to complete and the order of the exercises was randomized. Participants were allowed no more than 15 min before each exercise to study information about the scenario. Scripts modified from Gaugler and Rudolph (1992) were used to familiarize the candidates and confederates with the roles they were expected to play. In the second semester, the same undergraduate research assistants who ran the candidates through the simulated assessment center and acted as confederates acted as assessors to rate the dimensions. Prior to participating as an assessor, the research participants received a 3-day training session that explained assessment center methodology and the rating process, defined dimensions, and emphasized avoiding rating errors and maintaining focus on behaviors. The training program was consistent in content with the Guidelines and Ethical Considerations for Assessment Center Operations (Task Force on Assessment Center Guidelines, 1989). Assessor training included group practice and feedback sessions. Practice ratings were reviewed to ensure standardized performance. The training and assessment center protocols can be obtained from Chet Robie. Assessors were instructed to view the videotapes without rewinding and rate according to one of two processes. In the first process, the assessors were required to view performance in only one exercise and then make dimension ratings on all four dimensions (i.e., within-exercise rating process). Thus, in the within-exercise process, two assessors rated the performance of a given ratee on all four dimensions with one assessor using the first exercise as the basis of ratings and the other assessor using the second exercise as the basis of ratings. In the second process, the assessors were required to view performance in both exercises but make ratings on only one dimension for each exercise. Thus, in the within-dimension process, four assessors rated the performance of a given ratee twice on a single different dimension, once using the first exercise as the basis of ratings and once using the second exercise as the basis of ratings. Assessors were instructed to randomly choose for each candidate whether they would rate the candidate using a within-dimension or within-exercise rating process. Assessors rated multiple ratees in both rating conditions and no systematic pairing of raters was implemented. Thirteen of the 14 assessors completed approximately equal numbers of ratings. One of the assessors did not complete a sufficient number of ratings (only 3 candidates) and thus his rating data were removed from analyses and other raters were instructed to rerate the candidates that he had rated. Assessors did not rate candidates if they had acted as a confederate for that candidate in the previous semester. Each assessor was allowed to complete ratings using one of the two rating
362
ROBIE ET AL.
processes only once for each candidate. As the assessors viewed each candidate, they were given the option of completing a behavioral checklist to assist them in keeping track of behaviors exhibited by each candidate in each exercise. Dimension ratings were made on a Likert-format scale ranging from 1 (very poor) to 5 (excellent). The behavioral checklist and dimension-rating sheet can be obtained from Chet Robie. Assessors were blind to the true purpose of the study. RESULTS Means, standard deviations, and intercorrelations for each rating process are given in Table 1. Inspection of this table clearly shows that the monotrait–heteromethod correlations are higher for the within-dimension rating process compared to the within-exercise rating process and that the heterotrait–monomethod correlations are higher for the within-exercise rating process compared to the within-dimension rating process. The mean difference in monotrait–heteromethod correlations beTABLE 1 Means, Standard Deviations, and Intercorrelations for Each Rating Process
Within-exercise rating process OC1 OP1 AJ1 SN1 OC2 OP2 AJ2 SN2 Within-dimension rating process OC1 OP1 AJ1 SN1 OC2 OP2 AJ2 SN2
M
SD
OC1
OP1
AJ1
SN1
OC2
OP2
AJ2
SN2
2.98 2.50 2.87 2.92 2.93 2.35 2.82 2.98
1.11 1.16 1.06 1.02 1.15 1.05 1.22 1.08
— .38 .51 .49 .25 .20 .21 .15
— .72 .55 .41 .50 .52 .43
— .56 .40 .54 .47 .31
— .37 .29 .35 .35
— .66 .67 .65
— .79 .55
— .65
—
2.75 2.51 2.91 2.92 2.89 2.48 3.03 2.89
1.03 1.04 .96 1.04 .99 1.03 1.04 1.13
— .31 .35 .30 .83 .36 .25 .33
— .40 .19 .30 .75 .28 .25
— .20 .26 .35 .65 .20
— .27 .19 .17 .62
— .33 .19 .31
— .34 .29
— .26
—
Note. N = 100 for each rating process. OC1 = oral communication (Exercise 1); OP1 = organization and planning (exercise 1); AJ1 = analysis and judgment (Exercise 1); SN1 = sensitivity (Exercise 1); OC2 = oral communication (Exercise 2); OP2 = organization and planning (Exercise 2); AJ2 = analysis and judgment (Exercise 2); SN2 = sensitivity (Exercise 2). Exercise 1 = floor meeting; Exercise 2 = conflict mediation. Monotrait—heteromethod correlations are italicized.
CONSTRUCT VALIDITY
363
tween the two rating processes was .32 (ranging from .18 for analysis and judgment to .58 for oral communication). Confirmatory factor analysis (CFA) tests of our hypotheses were conducted using Amos 3.6 (Arbuckle, 1997). An 8 × 8 covariance matrix for each rating process was calculated and used in the analyses. Several competing models were examined for each rating process, including a single general factor model, a two-exercise factor model, and a four-dimension factor model. A single-factor model was included to investigate whether a general, halo factor could best account for either rating process. We used four indicators of fit to assess the models tested, including the chi-square goodness-of-fit test (Jöreskog, 1977), the goodness of fit index (GFI; Tanaka & Huba, 1985), the comparative fit index (CFI; Bentler, 1990), and the root mean square error of approximation (RMSEA; Browne & Cudeck, 1993). Nonsignificant chi-square values and GFI and CFI values above .90 generally indicate a good fit of the model to the data. An RMSEA of .05 and lower has been suggested as an indicator of close fit, whereas .05 to .10 suggests a reasonable fit of the model to the data (Browne & Cudeck, 1993). We expected that the two-exercise factor model would fit the within-exercise rating process data the best and the four-dimension factor model would fit the within-dimension rating process data the best. Results of the CFA tests are given in Table 2. Results strongly supported the hypotheses. The best fitting model for the within-exercise rating process was the two-exercise factor model. Although the chi-square value was statistically significant, the GFI and CFI were both above .90 and the RMSEA value was .10. Figure 1 displays the standardized estimates for this model. The factor loadings are routinely high and the correlation between exercise factors is also moderately high. The best fitting model for the within-dimension rating process was the four-diTABLE 2 Confirmatory Factor Analyses of Each Rating Process Model Specification Within-exercise rating process Single general factor Two exercise factors Four dimension factors Within-dimension rating process Single general factor Two exercise factors Four dimension factors
χ2
df
GFI
CFI
RMSEA
118.89* 37.63* 91.00*
20 19 14
.74 .91 .79
.76 .96 .82
.22 .10 .24
175.93* 160.04* 10.36
20 19 14
.71 .75 .98
.54 .58 1.00
.28 .27 .00
Note. For the chi-square values, N = 100. GFI = goodness of fit index; CFI = comparative fit index; RMSEA = root mean square error of approximation. *p < .05.
FIGURE 1 Two-factor standardized maximum likelihood solution for the within-exercise rating process model.
FIGURE 2 Four-factor standardized maximum likelihood solution for the within-dimension rating process model.
364
CONSTRUCT VALIDITY
365
mension factor model. The chi-square value was statistically nonsignificant, the CFI and GFI values were well above .90, and the RMSEA value was .00. Figure 2 displays the standardized estimates for this model. Again, the factor loadings are routinely high and the intercorrelations among the dimension factors are of moderate size (on average not as high as the correlation between the exercise factors in the previous model). The pattern of factor loadings and factor intercorrelations in Figure 2 provides evidence of strong convergent and discriminant validity for the ratings made using the within-dimension rating process. The factor loadings relating the manifest indicators to their latent constructs are very large, whereas the correlations between the ostensibly distinct dimensions are moderate in size. Moreover, the pattern of correlations among the latent factors is sensible in that the sensitivity dimension is least strongly related to the analysis and judgment dimension (r = .30) and most strongly related to the oral communication dimension (r = .40), and the two dimensions with the strongest relation are the organizing and planning dimension and the analysis and judgment dimension (r = .48). This particular pattern makes sense in that skills in sensitivity probably also require a degree of skill in oral communication and skills in organizing and planning probably also require a degree of skill in analysis and judgment.
DISCUSSION The finding of the exercise effect in assessment center ratings has long puzzled researchers and practitioners alike who find it difficult to reconcile the apparent strong evidence of predictive utility of overall assessment center ratings (e.g., Gaugler et al., 1987) with the apparent lack of discriminant validity of dimension ratings (e.g., Sackett & Dreher, 1982). The results of this study may provide some answers to this puzzle. Specifically, we found exercise factors when the rating process was manipulated such that assessors rated all dimensions within a single exercise and we found dimension factors when the rating process was manipulated such that assessors rated a single dimension across each exercise. Thus, the existence of the exercise effect found in previous construct validation efforts (Bycio et al., 1987; Sackett & Dreher, 1982; Turnage & Muchinsky, 1982) was eliminated due to the fact that the same assessor did not rate all dimensions within one exercise. Previous exploratory factor analyses indicated that a large number of dimension ratings could be best explained in terms of just three factors: leadership, organizing and planning, and decision making (Sackett & Hakel, 1979; Schmitt, 1977). This was contrary to the researchers’ expectations of factors mirroring the dimensions. Therefore, explanations were offered that rested on the idea that the assessment process placed heavy cognitive demands on assessors causing them to make global evaluations as opposed to dimension-based evaluations (Gaugler & Thorn-
366
ROBIE ET AL.
ton, 1989; Russell, 1985; Sackett & Hakel, 1979). For instance, having assessors evaluate each dimension after each exercise may reduce the cognitive demands placed on assessors. This rating process, known as the within-exercise rating method, was empirically investigated using factor analysis (Sackett & Dreher, 1982). Overall, the results indicated that the underlying factors represented the exercises as opposed to the individual dimensions. Thus, assessors’ ratings were best explained by the candidates’ performance on the exercises as opposed to the dimensions. An explanation of the exercise factors other than cognitive demands rested on the idea that the within-exercise rating method may contaminate the dimension ratings through the presence of halo error (Sackett & Dreher, 1982; Silverman et al., 1986; Turnage & Muchinsky, 1982). The design of the study precluded a highly definitive answer to the question of whether cognitive load or halo error is the causal factor for this study’s results. Specifically, the within-exercise rating process required assessors to make four ratings (four ratings for all of the dimensions in one exercise), whereas the within-dimension rating process required assessors to make two ratings (two ratings for the same dimension across both exercises). The reduction of cognitive load in the within-dimension rating process could have resulted in greater discriminant validity of the dimension ratings compared to the within-exercise rating process. However, given that Gaugler and Thornton (1989) found that reduction of cognitive load increased rating accuracy but did not increase discriminant validity of the ratings, we believe this interpretation to be weak in comparison to a halo interpretation. A future study could more definitively answer this question if cognitive load was somehow controlled across the rating process methods. Relatedly, increasing the number of exercises and dimensions in future studies to the number found in actual assessment centers would strengthen arguments for generalization of these findings to field settings. For example, it is possible that the situational specificity effect would be found across both rating process methods (even when cognitive demands are controlled for across methods) if a large number of ratings have to be made. It is also important to note that this study does not provide evidence that assessment center dimension ratings can be made construct valid through the use of the rating process manipulation used in the study. Instead, the study provides evidence to suggest that the rating process introduces a methodological artifact that may affect construct validity evidence. Additional research that addresses other methodological artifacts on construct validity evidence is warranted. Also, future construct validation investigations of assessment center ratings would need to control for these artifacts to make a contribution to the current debate. The primary practical implication for the results of this study is that employing as many assessors as there are dimensions for each candidate’s data may increase the discriminant validity of assessment center dimension ratings. However, this may prove to be overly expensive and restrictive for many users of assessment
CONSTRUCT VALIDITY
367
centers who may only have one or, at most, two assessors rate each candidate. It should be noted, however, that the use of videotape does not seem to reduce the observation or evaluation accuracy of assessment center ratings compared to direct observation (Ryan et al., 1995). Assessment center performance in a field setting can be videotaped, as was done in this study, and multiple trained assessors can view the videotape at their convenience. This would reduce the logistical problems involved in requiring all assessors to be present at the time the candidate performs the assessment center exercises. Thus, an organization using assessment centers may have to trade off the increased cost of employing multiple assessors with the possible decreased discriminant validity of their dimension ratings. Perhaps future research could investigate the possibility that the exercise effect can be lessened by a slightly modified version of the within-dimension rating process used in this study in which an assessor rates only a certain proportion of dimensions across multiple exercises. The use of videotape methodology would be conducive to such research efforts in that multiple rating processes could be examined in terms of their effects on construct validity without the need to run additional candidates through an assessment center. However, it also should be noted that when live observation is necessary and desirable (e.g., some users of assessment centers may not want their candidates to be videotaped for legal reasons or because they fear the candidates will react negatively or the videotaping itself will otherwise affect performance), it may be difficult if not impossible to implement a rating process design such as the one used in this study. It was expected that the use of undergraduate students as candidates would pose no threat to external validity because of the high fidelity of the simulation, the students’ familiarity with the job of RA, and the degree to which the job of RA is similar in many respects to many lower level managerial jobs. Several published studies have used this simulation to investigate issues related to the assessment center (Gaugler & Rudolph, 1992; Gaugler & Thornton, 1989). Furthermore, the undergraduate research assistants who served as assessors in our study were given 3 days of assessor training that was consistent with accepted assessment center guidelines (Task Force on Assessment Center Guidelines, 1989). The average length of assessment center training is 2.48 days (Spychalski, Quiñones, Gaugler, & Pohley, 1997). Several studies have provided evidence that behavioral checklists have the potential to increase the convergent and discriminant validity of assessment center ratings (Donahue, Truxillo, Cornwell, & Gerrity, 1997; Reilly et al., 1990). Assessors were given the option of using a behavioral checklist in this study. Moreover, we did not keep track of which assessors did or did not use the checklists for their ratings. The inclusion of the behavioral checklist in our study is thus a potential confounding variable. It is likely, however, that variance in the ratings due to the use of the behavioral checklist acted as random error variance because those assessors who did utilize the behavioral checklist in all likelihood used the checklist for
368
ROBIE ET AL.
both rating methods. Future research utilizing behavioral checklists should consider instituting them on a more systematic basis. Future research should also investigate the predictive validity of dimension ratings made using the different rating processes in relation to job-relevant outcomes. Although this study showed an increase in discriminant validity with the use of a within-dimension versus a within-exercise rating process, it is unknown the degree to which the increase in discriminant validity is matched by an increase in predictive utility of important, job-related outcomes. The performance appraisal literature on halo effects has found that they are weakly related to measures of rating accuracy (Kasten & Weintraub, 1999; Murphy & Balzer, 1989) and that the halo may actually act to increase criterion-related validities (Nathan & Tippins, 1990). Murphy and Cleveland (1995) suggested that these findings were due to the relation between the halo and reliability; higher halo generally acts to increase the accuracy of discriminating among ratees, across dimensions (i.e., differential elevation; Cronbach, 1955; Murphy, Gracia, Kerkar, Martin, & Balzer, 1982). It would stand to reason then that an overall assessment center rating saturated with halo would probably be a better predictor of job performance than a unit-weighted sum of the dimensions when using a within-dimension rating process. However, a high degree of halo may act to reduce the degree of discrimination raters make across dimensions for a given ratee; this may lead to inaccuracy in detecting ratee differences in patterns of performance (i.e., accuracy in diagnosing individual strengths and weaknesses or differential accuracy; Cronbach, 1955; Murphy et al., 1982). Murphy and Cleveland (1995) suggested that differential accuracy is important to increase when different decisions might be made, depending on the pattern of performance of a candidate across performance dimensions. Given that assessment centers are increasingly being used for developmental purposes (see Spychalski et al., 1997, p. 78), methods to increase the accurate identification of a candidate’s strengths and weaknesses across assessment center dimensions will be increasingly important. Thus, future research should also investigate whether the within-dimension rating process does indeed increase differential accuracy and subsequently whether the feedback is found to be more valuable in helping to change behavior. ACKNOWLEDGMENTS Chet Robie is now at Personnel Decisions International, Minneapolis, MN. Kimberly A. Adams is now at the American Institutes for Research, Washington, DC. We thank Amy Carver and Barbara Gaugler for providing us with the experimental materials, and Ryan Bullock, Tim Chen, Jennifer Cronquist, Alexis Dennis, Audalio Desouza, Cathleen Fanelli, John Keck, Ruwa Hijazi, Sung Kwan, Michael Mandzuk, Stacy Mitchell, Mary Anne Reinke, Dwight Robinson, and
CONSTRUCT VALIDITY
369
Kristen Watrous for helping to collect the data. We also thank Jane Chan for her help with literature searches and Edward Pavur and Ann Marie Ryan for their helpful comments on previous versions of this article. REFERENCES Adams, K. A. (1997). The effect of the rating process on construct validity: Reexamination of the exercise effect in assessment center ratings. Unpublished master’s thesis, University of Houston, Houston, TX. Arbuckle, J. L. (1997). Amos user’s guide (version 3.6). Chicago: Smallwaters. Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246. Berman, J. S., & Kenny, D. A. (1976). Correlational bias in observer ratings. Journal of Personality and Social Psychology, 34, 263–273. Bray, D. W., Campbell, R. J., & Grant, D. L. (1974). Formative years in business: A long-term AT&T study of managerial lives. New York: Wiley. Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136–162). Newbury Park, CA: Sage. Bycio, P., Alvares, K. M., & Hahn, J. (1987). Situational specificity in assessment center ratings: A confirmatory factor analysis. Journal of Applied Psychology, 72, 463–474. Cooper, W. H. (1981). Ubiquitous halo. Psychological Bulletin, 90, 218–244. Cronbach, L. J. (1955). Processes affecting scores on understanding of others’ and “assumed similarity.” Psychological Bulletin, 59, 177–193. Donahue, L. M., Truxillo, D. M., Cornwell, J. M., & Gerrity, M. J. (1997). Assessment center construct validity and behavioral checklists: Some additional findings. Journal of Social Behavior and Personality, 12, 85–108. Feldman, J. M. (1981). Beyond attribution theory: Cognitive processes in performance appraisal. Journal of Applied Psychology, 66, 127–148. Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., III., & Bentson, C. (1987). Meta-analysis of assessment center validity. Journal of Applied Psychology, 72, 493–511. Gaugler, B. B., & Rudolph, A. S. (1992). The influence of assessee performance variation on assessors’ judgments. Personnel Psychology, 45, 77–98. Gaugler, B. B., & Thornton, G. C., III. (1989). Number of assessment center dimensions as a determinant of assessor accuracy. Journal of Applied Psychology, 74, 611–618. Harris, M. H., Becker, A. S., & Smith, D. E. (1993). Does the assessment center scoring method affect the cross-situational consistency of ratings? Journal of Applied Psychology, 78, 675–678. Jöreskog, K. G. (1977). Structural equation models in the social sciences: Specification, estimation, and testing. In P. R. Krishnaiah (Ed.), Applications of statistics (pp. 265–287). Amsterdam: North-Holland. Kasten, R., & Weintraub, Z. (1999). Rating errors and rating accuracy: A field experiment. Human Performance, 12, 137–153. Murphy, K. R., & Balzer, W. K. (1989). Rating errors and rating accuracy. Journal of Applied Psychology, 74, 619–624. Murphy, K. R., & Cleveland, J. N. (1995). Understanding performance appraisal: Social, organizational, and goal-based perspectives. Thousand Oaks, CA: Sage. Murphy, K. R., Gracia, M., Kerkar, S., Martin, C., & Balzer, W. K. (1982). Relationship between observational accuracy and accuracy in evaluating performance. Journal of Applied Psychology, 67, 320–325.
370
ROBIE ET AL.
Nathan, B. R., & Tippins, N. (1990). The consequences of halo “error” in performance ratings: A field study of the moderating effect of halo on test validation results. Journal of Applied Psychology, 75, 290–296. Norton, S. D. (1977). The empirical and content validity of assessment centers vs. traditional methods of predicting managerial success. Academy of Management Review, 2, 442–453. Reilly, R. R., Henry, S., & Smither, J. W. (1990). An examination of the effects of using behavior checklists on the construct validity of assessment center dimensions. Personnel Psychology, 43, 71–84. Robertson, I., Gratton, L., & Sharpley, D. (1987). The psychometric properties and design of managerial assessment centres: Dimensions into exercises won’t go. Journal of Occupational Psychology, 60, 187–195. Russell, C. J. (1985). Individual decision processes in an assessment center. Journal of Applied Psychology, 70, 737–746. Ryan, A. M., Daum, D., Bauman, T., Grisez, M., Mattimore, K., Nalodka, T., & McCormick, S. (1995). Direct, indirect, and controlled observation and rating accuracy. Journal of Applied Psychology, 80, 664–670. Sackett, P. R. (1987). Assessment centers and content validity: Some neglected issues. Personnel Psychology, 40, 13–25. Sackett, P. R., & Dreher, G. F. (1982). Constructs and assessment center dimensions: Some troubling empirical findings. Journal of Applied Psychology, 67, 401–410. Sackett, P. R., & Hakel, M. D. (1979). Temporal stability and individual differences in using assessment information to form overall ratings. Organizational Behavior and Human Performance, 23, 120–137. Sackett, P. R., & Harris, M. (1988). A further examination of the constructs underlying assessment center ratings. Journal of Business and Psychology, 3, 214–229. Schmidt, F. L., Ones, D. S., & Hunter, J. E. (1992). Personnel selection. Annual Review of Psychology, 43, 627–670. Schmitt, N. (1977). Interrater agreement in dimensionality and combination of assessment center judgments. Journal of Applied Psychology, 62, 171–176. Silverman, W. H., Dalessio, A., Woods, S. B., & Johnson, R. L., Jr. (1986). Influence of assessment center methods on assessors’ ratings. Personnel Psychology, 39, 565–578. Spychalski, A. C., Quiñones, M. A., Gaugler, B. B., & Pohley, K. (1997). A survey of assessment center practices in organizations in the United States. Personnel Psychology, 50, 71–90. Tanaka, J. S., & Huba, G. J. (1985). A fit index for covariance structure models under arbitrary GLS estimation. British Journal of Mathematical and Statistical Psychology, 38, 197–201. Task Force on Assessment Center Guidelines. (1989). Guidelines and ethical considerations for assessment center operations. Public Personnel Management, 18, 457–470. Thornton, G. C. (1992). Assessment centers and managerial performance. Reading, MA: Addison-Wesley. Thornton, G. C., & Byham, W. C. (1982). Assessment centers and managerial performance. San Diego, CA: Academic. Turnage, J. J., & Muchinsky, P. M. (1982). Transsituational variability in human performance within assessment centers. Organizational Behavior and Human Performance, 30, 174–200. Zedeck, S. (1986). A process analysis of the assessment center method. In B. M. Staw & L. L. Cummings (Eds.), Research in organizational behavior (Vol. 8, pp. 259–296). Greenwich, CT: JAI.