A New Frame for Frame-of-Reference Training ... - APA PsycNET

7 downloads 0 Views 92KB Size Report
Jul 24, 2000 - Claremont McKenna College. The authors undertook a comprehensive examination of the construct validity of an assessment center in.
Journal of Applied Psychology 2002, Vol. 87, No. 4, 735–746

Copyright 2002 by the American Psychological Association, Inc. 0021-9010/02/$5.00 DOI: 10.1037//0021-9010.87.4.735

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

A New Frame for Frame-of-Reference Training: Enhancing the Construct Validity of Assessment Centers Deidra J. Schleicher

David V. Day

University of Tulsa

The Pennsylvania State University

Bronston T. Mayes

Ronald E. Riggio

California State University, Fullerton

Claremont McKenna College

The authors undertook a comprehensive examination of the construct validity of an assessment center in this study by (a) gathering many different types of evidence to evaluate the strength of the inference between predictor measures and constructs (e.g., reliability, accuracy, convergent and discriminant relationships), (b) introducing a theoretically relevant intervention (frame-of-reference [FOR] training) aimed at improving construct validity, and (c) examining the effect of this intervention on criterionrelated validity (something heretofore unexamined in the assessment center literature). Results from 58 assessees and 122 assessors suggested that FOR training was effective at improving the reliability, accuracy, convergent and discriminant validity, and criterion-related validity of assessment center ratings. Findings are discussed in terms of implications and future directions for both FOR training and assessment center practice.

however, has been less clear. Although researchers have often found evidence of convergent validity (in the form of a general trait or person factor), it has been accompanied by a corresponding lack of discriminant validity (Archambeau, 1979; Neidig, Martin, & Yates, 1979; Sackett & Dreher, 1982; Shore, Thornton, & McFarlane Shore, 1990; Turnage & Muchinsky, 1982). This robust finding of method variance (referred to as the exercise effect) has led many researchers to question the construct validity of assessment ratings. Interventions such as reducing the number of dimensions to be rated (Bycio, Alvares, & Hahn, 1987), using behavioral checklists (Reilly, Henry, & Smither, 1990), and varying the rating instructions (Silverman, Dalessio, Woods, & Johnson, 1986) have been proposed in an attempt to improve the construct validity of ACs. Although such interventions have shown some success, they are lacking in a couple regards. First, these investigations generally have solely looked at the effect of interventions on convergent and discriminant relationships within AC ratings (i.e., mono- and hetero- trait and method correlations). However, recent conceptualizations of validity have invoked a broader approach to establishing validity, and such convergent and discriminant relationships represent only one type of evidence for construct validity. Second, previous investigations have ignored theoretically relevant interventions from other literatures. Specifically, the role of assessor training in improving the construct validity of ratings has been ignored, despite the theoretical and empirically documented importance of training for improving the quality of judgments. Finally, as noted by Klimoski and Brickner (1987), previous studies examining variables thought to improve the construct validity of assessment ratings have not examined the effect on criterionrelated validities, even though criterion-related validity is an important aspect of construct validity (Landy, 1986). Others have noted the possibility that criterion-related validities would improve if abilities were better measured within exercises (e.g., Bycio et al.,

From its initial use for military officer selection in the 1940s, the assessment center (AC) has become a popular technique for assessing managerial potential. Debates over the validity of this method are also popular. After a brief review of the evidence for AC validity, the present study examines the efficacy of a specific intervention for improving the validity of ACs.

AC Validity The most impressive and consistent evidence for the validity of ACs comes in the form of criterion-related validity. From Bray and Grant’s (1966) early validation of AT&T’s Management Progress Study, to more recent meta-analyses (Gaugler, Rosenthal, Thornton, & Benston, 1987; Hunter & Hunter, 1984; Schmitt, Gooding, Noe, & Kirsch, 1984), there have been consistent findings of moderate to high validity coefficients for ACs in predicting managerial success. The evidence for the construct validity of ACs,

Deidra J. Schleicher, Department of Psychology, University of Tulsa; David V. Day, Department of Psychology, The Pennsylvania State University; Bronston T. Mayes, Department of Management, California State University, Fullerton; Ronald E. Riggio, Kravis Leadership Institute, Claremont McKenna College. Portions of this article were presented at the 14th Annual Conference of the Society for Industrial and Organizational Psychology, May 1999, Atlanta, Georgia. This study is based on a doctoral dissertation completed by Deidra J. Schleicher at The Pennsylvania State University. We gratefully acknowledge the helpful guidance and insight provided by James Farr, Martin Kilduff, and Melvin Mark, as well as the invaluable assistance of Bill Culbertson, Jackie Flynn, Tria Pavlides, Tim Semen, and Melissa Wyant. Correspondence concerning this article should be addressed to Deidra J. Schleicher, Department of Psychology, 308 Lorton Hall, University of Tulsa, Tulsa, Oklahoma 74104-3189. E-mail: deidra-schleicher@ utulsa.edu 735

SCHLEICHER, DAY, MAYES, AND RIGGIO

736

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

1987; Sackett, 1987; Silverman et al., 1986) and have encouraged such examinations. The fact that the impact of an improvement in construct validity on the criterion-related validity of ACs has not been examined seems to be a particularly large oversight given that it is the predictive validity by which the effectiveness of ACs is most likely to be judged. In the present study, we attempted a more comprehensive examination of the construct validity of an AC by (a) widening the lens for viewing evidence of construct validity, (b) introducing a theoretically relevant training intervention aimed at improving construct validity, and (c) examining the impact of this intervention on criterion-related validity as well as construct validity.

1995; McIntyre, Smith, & Hassett, 1984; Pulakos, 1984, 1986; Schleicher & Day, 1998; Stamoulis & Hauenstein, 1993; Sulsky & Day, 1992, 1994; Woehr, 1994). A meta-analysis evaluating the relationship between performance appraisal training and rating outcomes (e.g., halo, leniency, observational and rating accuracy) concluded that FOR training results in the largest overall increase in rating accuracy (Woehr & Huffcutt, 1994). The theoretical relevance of FOR training for improving the construct validity of AC ratings is highlighted in the following section by comparing explanations for the exercise effect with known outcomes and advantages of FOR training.

FOR Training for Assessors Assessor Training As set forth by the guidelines on ACs (Task Force on Assessment Center Guidelines, 1989; Task Force on Assessment Center Standards, 1980), assessor training is an integral component of the AC program. The potential for increasing the reliability and the construct validity of ratings through more careful attention to assessor training has been emphasized by many authors (e.g., Bray & Grant, 1966; Norton, 1981; Silverman et al., 1986; Spychalski, Quinones, Gaugler, & Pohley, 1997; Thomson, 1970; Turnage & Muchinsky, 1982). Training has also been specifically singled out as an important determinant of the strength of exercise factors (Schmitt, Schneider, & Cohen, 1990; Turnage & Muchinsky, 1982). Despite all the suppositions, however, there is no empirical research on the effectiveness of various assessor training strategies for improving AC validity. Even the guidelines on ACs (Task Force on Assessment Center Guidelines, 1989; Task Force on Assessment Center Standards, 1980) are vague on this point, simply noting that “whatever the approach” (Task Force on Assessment Center Guidelines, 1989, p. 465) to assessor training, the objective is obtaining accurate judgments. This emphasis on accuracy, especially considered in conjunction with Harris, Becker, and Smith’s (1993) comment that future AC research may benefit by considering work in the performance appraisal area, suggests a specific assessor training strategy that should be effective for improving the construct validity of AC ratings. Frame-of-reference (FOR) training is an intervention designed to improve the validity of trait judgments in the performance appraisal context. It is theoretically relevant to an investigation of the construct validity of AC ratings, as described below.

Overview of FOR Training Initially proposed by Bernardin and Buckley (1981), the primary goal of FOR training was to eliminate idiosyncratic standards held by raters and replace them with a common frame of reference for rating. The overall intent of the training was to bring individual perceptions into closer congruence with those held more widely in the organization. The steps involved in such training include (a) teaching important dimensions comprising the job and behaviors indicative of each dimension, (b) discussing behaviors indicative of various effectiveness levels within each dimension, (c) providing practice evaluations with the new frame of reference, and (d) giving feedback on the accuracy of the ratings (Pulakos, 1984). There is abundant and consistent evidence that FOR training increases the accuracy of performance appraisal ratings (e.g., Athey & McIntyre, 1987; Cardy & Keefe, 1994; Day & Sulsky,

Most explanations found in the literature for the exercise effect invoke the limited information-processing capacity of assessors. To make their ratings, assessors must observe and recall behaviors for each assessee in each exercise and then categorize these behaviors into relevant dimensions. This is a demanding task, and it may be unrealistic to expect assessors to do all of this accurately and reliably, given the basic limitations in human information processing (Bycio et al., 1987; Reilly et al., 1990). These limitations are often exacerbated by the lack of clarity in the dimension definitions. That is, a great deal of the exercise effect may be caused by scoring the same behavior for several abilities (i.e., dimensional dependency; Brannick, Michaels, & Baker, 1989) or interpreting the dimension differently from one exercise to another (Robertson, Gratton, & Sharpley, 1987). Reilly et al. observed that definitions of dimensions are typically written in general terms and are not clearly related to the operational definitions (i.e., behaviors elicited by particular exercises). It is then left to the already cognitively overburdened assessors to decide which behaviors are part of which dimensions. Another explanation for the exercise effect involves the use of categorization schemas. Shore et al. (1990) noted that schemas greatly simplify the challenging task required of assessors. That is, the ability to organize performance information around relevant categories should make the rating task both easier and more accurate. However, the AC may provide such a strong organizational framework in terms of exercises that assessors are forced to code the behavior by exercise rather than by dimension (Silverman et al., 1986), resulting in the exercise effect. Strong manipulations may be necessary to modify these schemas and therefore allow the assessors to successfully reorganize the information they have obtained by dimension. The previously discussed explanations for the exercise effect seem to converge in suggesting an intervention that would (a) reduce information-processing demands placed on assessors, (b) provide greater clarity to the dimension definitions, and (c) change categorization schemas to be more dimension oriented. These are all established outcomes of FOR training. First, the central role of dimensions in FOR training is readily apparent. Most relevant to the present study is the fact that a large part of FOR training involves a discussion of the dimensions and the behaviors indicative of each dimension. Obviously, such an emphasis should help assessors distinguish between dimensions and thereby help to reduce the exercise effect. Moreover, assessors also learn how to apply consistent standards across situations (i.e., exercises). This strategy is believed to reduce the cognitive demands placed on assessors. Furthermore, previous research on FOR training (Sulsky

FRAME-OF-REFERENCE TRAINING IN ASSESSMENT CENTERS

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

& Day, 1992) has shown that such training teaches raters to form more accurate on-line impressions, reducing the need to recall behaviors for each assessee in each exercise and then to categorize these behaviors into relevant dimensions. With the cognitive demands reduced, FOR-trained assessors should provide more accurate and reliable dimension ratings. A final point relevant to the exercise effect is that previous research has shown that FOR training changes categorization schemas to more trait-based representational forms (Schleicher & Day, 1998). Therefore, FOR training should lead to assessors organizing candidate information around dimension categories, as opposed to exercise categories, thereby improving the construct validity of the assessment ratings.

FOR Training and the Construct Validation of ACs As Landy (1986) noted, all validation efforts should be construed as part of the broader process of scientific theory development (i.e., traditional hypothesis testing)—a process of accumulating various types of judgmental and empirical evidence to support an inferential network of psychological constructs and their operational measures (Binning & Barrett, 1989). Accordingly, any information gathered in the process of developing or using a test is relevant to its validity in that it contributes to users’ understanding of the underlying construct (Anastasi, 1986). In that spirit, the present study examines many different sources of evidence for the construct validity of ACs. In the context of theory building for the construct validity of ACs, FOR training of assessors can be thought of as an intervention that impacts the strength of two inferences: (a) the link between the predictor measure and the predictor construct (i.e., what is traditionally thought of as construct validity) and (b) the link between the predictor measure and the criterion measure (i.e., what is traditionally thought of as criterion-related validity). It is believed that FOR training will improve the overall quality of the predictor measure (i.e., assessment ratings) and therefore the strength of both of these inferences. In other words, training serves as a moderator of both construct validity (e.g., convergent– discriminant) relationships and criterion-related relationships. The specific components involved in testing these two inferences are described next.

Inference 1: Predictor Measure 3 Predictor Construct This inference pertains to the level of confidence a researcher has that a predictor measures a specific construct of interest. Such evidence usually takes the form of empirically based relationships and judgments that are both convergent and discriminant in nature (Campbell & Fiske, 1959; Cook & Campbell, 1979; Cronbach & Meehl, 1955); however, examining the reliability of a predictor measure (e.g., AC ratings) can also provide construct-oriented evidence regarding how well components (e.g., different assessors’ ratings) model the underlying predictor construct (Anastasi, 1986; Binning & Barrett, 1989). Therefore, we examined the impact of FOR training on improving the strength of this inference with regard to both reliability and convergent and discriminant relationships. We hypothesized that, because the goal of FOR training is to teach all assessors to view performance with a common frame of reference, the interrater reliability among FOR-trained assessors will be higher than the interrater reliability among control-trained assessors (Hypothesis 1).

737

As noted above, the strength of Inference 1 is also shown by patterns of convergent and discriminant correlations between the predictor measure and other measure of similar and different constructs. Although FOR training is generally expected to improve both the convergence and the discrimination of assessment ratings, this does not necessarily translate into straightforward hypotheses with regard to the pattern of mono- or hetero- trait and of method correlations. That is, FOR training is not expected to increase the monotrait– heteromethod correlations as an index of convergent validity. As Sackett (1987) noted, “Ratings of dimensions across exercises aren’t intended as merely repeated measures that should correlate perfectly well were it not for measurement error; rather, they are intended to tap only partially overlapping subsets of the behavioral domain that the dimension label represents” (p. 19). Therefore, FOR training, by improving the quality of assessment ratings, would not necessarily increase these monotrait– heteromethod correlations. Moreover, this type of convergent validity has often been found in previous research, even with untrained assessors (e.g., Neidig et al., 1979; Sackett & Dreher, 1982; Turnage & Muchinsky, 1982). In fact, it is possible that these correlations might be higher for control raters, given that they would be more likely to use broader categories (i.e., more general trait or person factors) for rating, which would contribute to the illusion of convergent validity. Therefore, nothing is hypothesized with regard to the size of monotrait– heteromethod correlations. What FOR training is expected to do, however, is to improve discriminant validity by clarifying the differences between assessment dimensions. We therefore expected to see lower correlations within exercises across different dimensions (heterotrait–monomethod correlations; Hypothesis 2a) and lower correlations between different dimensions in different exercises (heterotrait– heteromethod correlations; Hypothesis 2b) for FOR assessors than for control assessors. What has been described above represents only one way of assessing the convergent and discriminant validity of AC ratings, however. Several researchers have noted that examining relationships across and within exercises may be lacking as the sole strategy for judging claims of the construct validity of ACs. Perhaps a more appropriate standard would involve other methods of assessing similar constructs (e.g., Fleenor, 1996; Riggio, Aguirre, Mayes, Belloli, & Kubiak, 1997; Shore et al., 1990). Reilly et al. (1990) noted that the convergent and discriminant validity of assessment ratings should be judged on a multitrait–multimethod matrix that includes total scores on the AC dimensions correlated with other measures of those dimensions obtained outside of the AC exercises (e.g., personality and cognitive ability measures). These prescriptions suggest more of a nomological network (Cronbach & Meehl, 1955) approach to establishing the construct validity of assessment ratings. The present study adopts this more inclusive view of convergent and discriminant validity and examines relationships between dimension ratings and other personality and skills measures that assess both similar and different constructs. The personality and skills measures that we examined are those that were collected at the same time the assessees participated in the AC and include the need for dominance (Steers & Braunstein, 1976), self-monitoring (Snyder & Gangested, 1986), and social skills (Riggio, 1989). We expected these measures to converge with ratings of the three dimensions assessed in the present AC (leadership, communication skills, and decision-making skills).

SCHLEICHER, DAY, MAYES, AND RIGGIO

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

738

Specifically, self-monitoring, social skills, and the need for dominance should converge with the leadership dimension. High selfmonitors tend to have generally outgoing and extraverted social styles that lead them to initiate conversations in interpersonal situations (Ickes & Barnes, 1977), as would those individuals high in social skills. This style would also lend itself to proportionally more contributions to group discussions, which have been shown to be predictive of leader emergence in groups (Stein & Heller, 1979). The need for dominance assesses individuals’ attempts to control their environments and to influence others and the extent to which they enjoy the role of leader (Steers & Braunstein, 1976); such factors should obviously converge with AC leadership ratings. Self-monitoring and social skills should also converge with the communication skills dimension, given the importance of such skills in effective interpersonal communication. Therefore, because of these expected convergent relationships between personality and skills measures and the assessment dimensions, we expected that the dimension ratings of FOR-trained assessors, compared with the control-trained assessors, would demonstrate higher correlations with similar personality measures (Hypotheses 3a, 3b, 3c, 4a, and 4b). An additional way to examine the convergent validity of the assessment ratings is to compare them with ratings on the same dimensions made by others at a different time in a different context. In the present study, this second source of dimension ratings is provided by the assessees’ supervisors, and we hypothesized that the ratings of FOR-trained assessors, as compared with those of control-trained assessors, would converge better with the dimensional ratings provided by the supervisors (Hypothesis 5). Accuracy is the final criterion used to evaluate the effectiveness of FOR training at strengthening Inference 1. We hypothesized, on the basis of extensive research on FOR training, that FOR-trained assessors would provide more accurate ratings than control-trained assessors (Hypothesis 6).

months (SD ⫽ 26.3); 41% worked in the business sector (e.g., accounting, banking, investments, real estate), 25% worked in the service industry, 9% were in the manufacturing industry, and another 9% were in the transportation, communication, or public utilities sectors. Although the number of participants is consistent with typical validation sample sizes in industry (Schmidt, Hunter, & Urry, 1976), given the low but not atypical survey return rate (Edwards, Thomas, Rosenfeld, & BoothKewley, 1997), potential response biases were explored. Differences were examined between respondents and nonrespondents on all feasible demographic and personality variables. There were no differences in terms of age, gender, ethnic group membership, or any of the personality variables (all ps ⬎ .05). Taken together, these comparisons suggest little reason to be concerned that response bias was a significant factor in the present study. Assessors. Students at a different state university than the assessees (n ⫽ 122) served as assessors and viewed the videotaped AC performance of the 75 assessees who had returned the survey. Sixty-seven (55%) of these participants were male, and 98% of them were psychology students (with 3 master of business administration students). The mean age of the assessor participants was 22.4 years (SD ⫽ 0.5). Twenty-nine of the participants had managerial experience, 37 had conducted performance appraisals as part of their jobs, and 6 had participated in an AC. These assessors participated in the present study in groups of 4 to 6. Each group (22 total) was randomly assigned to a training condition (FOR or control; n ⫽ 11 groups each) and observed and rated the performance of 5 to 6 assessees.

AC Exercises Three commonly used AC exercises were included in the present study: (a) leaderless group discussion (LGD), (b) presentation exercise, and (c) mock hiring interview. Each exercise was designed to provide an opportunity to rate three dimensions: (a) decision-making skills (including problem solving, analytical and decision-making ability, and creativity), (b) communication skills (including clarity of communication, expressiveness, and language and listening skills), and (c) leadership (including persuasiveness, influence on others, and human relations abilities). For control purposes and to keep the rating task manageable, only the first 5 min of both the LGD and the interview were observed and rated by the assessors.

Inference 2: Predictor Measure 3 Criterion Measure FOR training is believed also to play a moderating role in the predictor measure– criterion measure link, thereby affecting the criterion-related validity of AC ratings. In general, we expected that the ratings produced by FOR-trained raters— because of their better reliability, accuracy, and convergent– discriminant validity—will show stronger relationships with job performance than will those produced by control-trained raters (Hypothesis 7).

Method Participants Assessees. Approximately 500 former business students who participated in an undergraduate AC were recruited. All assessees completed personality and skills measures in the AC, which was conducted approximately 4 years prior to the collection of the criterion data. All AC performance was videotaped. Assessees were contacted via mailed surveys to collect the criterion information. If assessees failed to return the surveys, a follow-up phone call was made. Seventy-five respondents returned a survey (20% of total valid surveys sent; 105 surveys were returned from the U.S. Postal Service as undeliverable); there was complete data for 58 of the respondents. The breakdown of the sample was 54% male, 46% female; 50% White, 25% Asian, 13% Hispanic, 9% Filipino, 2% American Indian, and 2% Middle Eastern; mean age was 29.0 years (SD ⫽ 7.7); mean tenure was 32.0

Development of Rating Scales and Comparison Scores Seven-point behaviorally anchored rating scales (BARS) were used to rate performance in each exercise on each of the three dimensions. Similar to the approach taken by Reilly et al. (1990), the first step in the development of these scales was to identify behaviors relevant to the three dimensions that are elicited by specific exercises. A group of five undergraduate research assistants trained on the meaning of the three dimensions viewed the videotapes of all 500 assessees and recorded critical incident behaviors that were indicative of exceptionally good or poor performance. Procedures outlined by Smith and Kendall (1963) were then followed for selecting behavioral anchors for the BARS. Specifically, the critical incidents were retranslated into the three dimensions by the five research assistants and the primary investigator, Deidra J. Schleicher. A criterion of 83% agreement (five out of six) was used to select behaviors for each dimension. The research team then assigned effectiveness ratings to these behaviors. On the basis of the means and the standard deviations of these effectiveness ratings, behavioral anchors for the BARS were selected. Each behavioral anchor had three components representing a behavioral example from each of the three exercises (such a rating format is considered a hybrid “task-BARS” approach; Harvey, 1991). Comparison scores were developed for each of the assessees from the ratings of the research team. These comparison scores were used in computing the accuracy components. The expert assessors received extensive FOR training prior to independently rating each assessee (as recommended by Sulsky & Balzer, 1988); initial consensus was estimated at .86

FRAME-OF-REFERENCE TRAINING IN ASSESSMENT CENTERS (intraclass correlation). Experts then discussed their ratings to arrive at consensus comparison scores for each assessee on each dimension in every exercise.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Procedure Assessors were randomly assigned in groups of 4 to 6 to receive control or FOR training. They were given basic knowledge of the design and uses of an AC and then received either control or FOR training (90 min) before rating the target assessees (90 min). FOR training. Both the FOR and control training conditions included those practices common to assessor training programs in most ACs (Spychalski et al., 1997), with the FOR training including additional components. FOR assessors received a description of the three exercises in the AC, including a videotaped demonstration. Copies of the dimension definitions and BARS were distributed, and the trainer read aloud the dimension definitions and scale anchors, including the different ways in which the dimension could be manifested across exercises. Behaviors pertaining to different effectiveness levels for each dimension were also discussed by the trainer. The overall goal of this training was to create a common performance theory (i.e., frame of reference) among assessors so that they would agree on the standards used to evaluate assessees’ behaviors. Assessors then practiced rating 4 assessees with their new frame of reference. The assessors were instructed to evaluate each assessee after each exercise on each performance dimension. After each exercise, assessors were asked to share their ratings with the rest of the group; the trainer led a discussion about the assigned ratings, clarifying discrepancies among raters and providing feedback on the appropriate effectiveness level on each dimension portrayed by each assessee. After training, assessors viewed the videotape of the 5 or 6 target assessees and, after each exercise, independently provided ratings on each dimension by using the BARS. Control training. The introduction to the design and uses of ACs was lengthier in the control condition than in the FOR condition to equate overall training session time between the two conditions. The assessors in the control training condition also received a complete description of the three exercises in the AC, saw the videotaped demonstration of the exercises, and were given the dimension definitions and BARS to examine. Control assessors also viewed and rated the practice assessees, but their ratings were not discussed and no feedback was provided. Instead, the demonstration of the exercises and the practice ratings were used to emphasize the importance of accurately observing and recalling behavior. After viewing the practice tapes, the control assessors watched the target videotape of the 5 or 6 assessees and provided ratings for each assessee on each dimension after each exercise. We should note that each assessee’s performance was rated by both a FOR-trained group and a control group.

Criterion Variables For each assessee, there were within-exercise dimension ratings, acrossexercise dimension ratings, and a mechanically derived overall assessment rating (OAR; a unit-weighted composite of the across-exercise dimension ratings). In addition to examining relationships within and between exercises and dimensions (i.e., convergent and discriminant relationships based on within-exercise dimension ratings), other relationships included those between the ratings and the comparison scores (i.e., accuracy relationships based on across-exercise dimension ratings), those between the ratings and other personality and ability measures (based on across-exercise dimension ratings), and those with job performance criterion variables in a predictive design (based on the OAR). Accuracy. Five indexes of accuracy were computed. Cronbach’s (1955) four component indexes include (a) elevation (the differential grand mean), (b) differential elevation (the differential main effect of ratees), (c) stereotype accuracy (the differential main effect of dimensions), and (d) differential accuracy (the differential Ratee ⫻ Dimension Interaction).

739

Borman’s (1977) differential accuracy is the average of the z-transformed correlations between a rater’s ratings for each dimension and the corresponding true scores across ratees. Sulsky and Balzer (1988) provide a more complete description of these accuracy indexes. Reliability. The interrater reliability for ratings of each assessee on each dimension was computed for all groups of assessors. In accordance with Schmitt (1977) and Schneider and Schmitt (1992), assessors in each group were treated as items (and assessees were treated as cases), enabling the calculation of internal consistency (i.e., coefficient ␣) estimates for each dimension. Confidence intervals were constructed around these reliability estimates (McGraw & Wong, 1996), allowing for comparison between FOR and control groups’ interrater reliability. Because longer scales (i.e., more assessors) lead to greater internal consistency when all other things are equal, some assessors in each group were randomly excluded from the reliability analyses, resulting in equal numbers of FOR and control-trained assessors for each videotape. Personality and skills. The Self-Monitoring Scale (Snyder & Gangestad, 1986) is an 18-item true–false instrument designed to assess the extent to which people “monitor (observe, regulate, and control) the public appearances of self they display in social situations and interpersonal relationships” (Snyder, 1987, pp. 4 –5). Although the Self-Monitoring Scale has its critics on issues such as dimensionality (e.g., Briggs & Cheek, 1988), a recent study by Gangestad and Snyder (2000) showed that self-monitoring is a unitary construct with regard to its relations with such external criteria as expressive control, nonverbal decoding skills, and behavioral variability. In conjunction with research demonstrating that high self-monitors are more likely to emerge as leaders (Day, Schleicher, Unckless, & Hiller, 2002) and have stronger interpersonal and communication skills (e.g., Sypher & Sypher, 1983), these relationships illustrate the relevance for examining expected convergent relationships. Coefficient alpha for the Self-Monitoring Scale was .71 in the present study. The Social Skills Inventory (Riggio, 1989) consists of 90 items rated on a 5-point Likert-type scale. A high score on this scale indicates a high degree of competence in social situations. The Social Skills Inventory demonstrated convergent and discriminant validity in a series of studies reported by Riggio (1986). Specifically (and relevant to its use in the present study), the Social Skills Inventory correlated in predicted directions with the Affective Communication Test (Friedman, Prince, Riggio, & DiMatteo, 1980) and the Profile of Nonverbal Sensitivity (Rosenthal, Hall, DiMatteo, Rogers, & Archer, 1979). Reliability (Cronbach’s ␣) for this scale was .88. Finally, the need for dominance—referring to an individual’s attempts to control the environment, influence others, and assume leadership roles— was assessed via the Manifest Needs Questionnaire (Steers & Braunstein, 1976). This scale consists of 20 items (5 items per subscale; only the Need for Dominance subscale was included in the present hypotheses) with a 7-point Likert-type response format. The reliability estimate for the dominance subscale was .71 (coefficient ␣). Supervisor ratings. The survey mailed to assessees included several measures that were to be given to the assessees’ immediate supervisors or managers to be filled out and mailed directly back to the researcher Deidra J. Schleicher. Three of these measures were the same 7-point BARS used by assessors, with the behavioral anchors modified so that they were tied less to the AC setting. In addition, the supervisors also completed two other measures of performance: an overall evaluation of the employee and an assessment of the employee’s future potential (both scaled from 1 to 7). These five ratings were combined to create an overall index of job performance (␣ ⫽ .89).

Results and Discussion Intercorrelations and descriptive statistics for the assessee variables are reported in Table 1. Table 2 reports intercorrelations and descriptive statistics for the assessor variables. These same intercorrelations are reported separately for FOR and control assessors in Table 3.

SCHLEICHER, DAY, MAYES, AND RIGGIO

740

Table 1 Intercorrelations, Means, and Standard Deviations for the Assessee Variables Variable

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

1. 2. 3. 4. 5. 6. 7. 8. 9.

M

Gendera Age Tenureb Performancec Self-monitoring Social skills Need for dominance FOR ratingd Control ratinge

SD

1

2

3

4

5

1.5 0.5 — 29.0 7.7 .29* — 32.0 26.3 .05 .69** — 5.7 0.9 .08 ⫺.11 ⫺.05 — 9.1 3.4 ⫺.37** ⫺.11 .00 ⫺.18 278.0 28.1 ⫺.25 ⫺.04 .07 .22 21.6 4.4 ⫺.41** .05 .21 .19 4.4 1.0 ⫺.37* ⫺.12 .07 .31* 4.0 1.3 ⫺.32* ⫺.05 .07 .21

6

7

8

9

— .57** — .45** .39** — .34* .24 .40** — .31* .14 .39** .93** —

Note. Ns range from 40 to 75. a 1 ⫽ male, 2 ⫽ female. b In months. c Average of three dimension ratings, overall performance, and potential. d The overall assessment center rating (on a scale from 1 to 7) given by assessors receiving frame-of-reference (FOR) training. e The overall assessment center rating (on a scale from 1 to 7) given by assessors receiving control training. * p ⱕ .05. ** p ⱕ .01.

an important one. However, as always, reliability information is most appropriately viewed as a necessary, but insufficient, condition for claims of construct validity.

Interrater Reliability The first six hypotheses examined the effect of FOR training on the quality of assessors’ ratings. Hypothesis 1 stated that the interrater reliability would be higher for the FOR-trained assessors than for the control-trained assessors. Internal consistency (i.e., coefficient ␣) estimates for each dimension were as follows: communication skills, .83 (FOR) and .76 (control); decisionmaking skills, .89 (FOR) and .69 (control); and leadership skills, .75 (FOR) and .65 (control). Across the three dimensions, the average reliability for the FOR ratings was .82 (coefficient ␣), compared with .70 for the control ratings. These results reveal that the AC ratings overall had quite high levels of interrater reliability (consistent with previous research; e.g., Klimoski & Brickner, 1987) and that the FOR ratings were consistently more reliable than the control ratings. Moreover, for all four comparisons, the confidence intervals around the reliability estimates were nonoverlapping (McGraw & Wong, 1996), indicating that the FOR ratings were significantly more reliable than the control ratings. Thus, Hypothesis 1 was supported. Given that the goal of FOR training is to teach all of the assessors to view performance with a common frame of reference, such a finding is not surprising, although it is

Mono- and Hetero- Trait and Method Correlations Hypotheses 2a and 2b predicted that the heterotrait– monomethod correlations and the heterotrait– heteromethod correlations would be lower for FOR assessors than for control assessors, respectively. Correlations between ratings on the same and different exercises and dimensions are reported in Table 4, for both the FOR and control assessors. To test the hypotheses, z transformations were performed on these individual correlations, and the resulting coefficients were averaged to obtain measures of convergent and discriminant validity. Although not part of the hypotheses, the average convergent validity coefficient (i.e., monotrait– heteromethod correlation) was .34 for FOR assessors and .48 for control assessors (both ps ⱕ .05). A t test for significant differences between nonindependent correlations (Steiger, 1980) revealed that this was a significant difference, t(16) ⫽ ⫺2.14, p ⱕ .05 (two-tailed; R2 ⫽ .246); the monotrait– heteromethod correlations were higher for control assessors than for FOR assessors. The

Table 2 Intercorrelations, Means, and Standard Deviations for the Assessor Variables Variable 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

Gendera Age Studentb Managerial experience Performance appraisal experience Assessment center experience Elevation Differential elevation Differential accuracy Stereotype accuracy Borman’s differential accuracy

M

SD

1

2

3

4

5

6

7

8

9

1

1 0

1.5 22.4 2.0 1.8 1.7 2.0 0.7 0.6 0.4 0.3 1.0

0.5 4.5 0.2 0.4 0.5 0.2 0.9 0.2 0.1 0.2 0.5

— .05 .04 .21* .22* .06 ⫺.13 ⫺.13 .05 ⫺.03 .13

— ⫺.19* ⫺.37** ⫺.31** ⫺.19* ⫺.11 .05 .02 .03 ⫺.09

— .16 .13 .21* ⫺.08 ⫺.01 ⫺.15 .02 .22*

— .43** .23** .05 ⫺.07 .02 ⫺.09 .09

— .10 .03 ⫺.15 .01 ⫺.12 .02

— .00 .16 .01 ⫺.18* ⫺.10

— .02 .18* .10 ⫺.09

— ⫺.04 ⫺.03 ⫺.28**

— .12 ⫺.37**

— ⫺.18*



Note. N ⫽ 122. For managerial, performance appraisal, and assessment center experience, 1 ⫽ yes and 2 ⫽ no. a 1 ⫽ male, 2 ⫽ female. b 1 ⫽ master of business administration students, 2 ⫽ psychology students. * p ⱕ .05. ** p ⱕ .01.

1

FRAME-OF-REFERENCE TRAINING IN ASSESSMENT CENTERS

741

Table 3 Intercorrelations for the Assessor Variables

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Variable 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

Gender Age Studenta Managerial experience Performance appraisal experience Assessment center experience Elevation Differential elevation Differential accuracy Stereotype accuracy Borman’s differential accuracy

1

2

3

— .11 .13 .13 .12 .01 ⫺.07 ⫺.07 .09 .06 .01

⫺.12 — ⫺.23 ⫺.53** ⫺.35** ⫺.37** ⫺.13 ⫺.03 .04 .07 ⫺.03

⫺.04 ⫺.23 — .25* .18 ⫺.02 .07 ⫺.18 ⫺.15 .07 .13

4

5

6

7

8

9

10

11

.28* ⫺.14 .09 — .41** .14 ⫺.10 ⫺.15 ⫺.07 ⫺.21 .21

.36** ⫺.25 .09 .48** — .26* .01 ⫺.13 .05 ⫺.01 ⫺.04

.09 ⫺.02 .32* .29* ⫺.02 — .01 .09 .1 ⫺.08 ⫺.15

⫺.12 ⫺.09 ⫺.11 .19 .04 .05 — ⫺.13 .00 .10 .02

⫺.13 .28* .13 .05 ⫺.21 .26 ⫺.09 — ⫺.02 ⫺.17 ⫺.29*

.05 .04 ⫺.14 .13 ⫺.03 ⫺.02 .20 ⫺.17 — .21 ⫺.46**

⫺.08 .02 .00 .00 ⫺.20 ⫺.22 .07 .00 .05 — ⫺.12

.22 ⫺.27* .28* ⫺.05 .10 ⫺.09 ⫺.09 ⫺.23 ⫺.27* ⫺.21 —

Note. Frame-of-reference assessors (n ⫽ 65) are below and control assessors (n ⫽ 57) are above the diagonal. For managerial, performance appraisal, and assessment center experience, 1 ⫽ yes and 2 ⫽ no. a 1 ⫽ master of business administration students, 2 ⫽ psychology students. * p ⱕ .05. ** p ⱕ .01.

average discriminant validity coefficient (i.e., heterotrait– monomethod correlations) was .66 for FOR assessors and .74 for control assessors. As we predicted (Hypothesis 2a), the heterotrait–monomethod correlations were significantly smaller for FOR assessors than for control assessors, t(16) ⫽ ⫺2.54, p ⱕ .05 (one-tailed; R2 ⫽ .338). The heterotrait– heteromethod correlations were .28 and .48 for FOR and control assessors, respectively. This was a significant difference, t(34) ⫽ ⫺5.62, p ⱕ .01 (one-tailed; R2 ⫽ .497), and thus, Hypothesis 2b was also supported. The support for Hypotheses 2a and 2b, in conjunction with the magnitude of the monotrait– heteromethod correlations, suggests an intriguing conclusion. For all three types of correlations, control raters had larger correlations than FOR raters. This finding suggests that control assessors did not discriminate between dimensions as effectively as FOR assessors. This lack of discrimination is dramatically illustrated by the sizable magnitude of the

heterotrait– heteromethod correlation for control assessors (.48), which represents the average correlation between ratings of different traits in different exercises. Although it may not be realistic to expect FOR training to increase the monotrait– heteromethod correlations, it does appear that FOR training can improve assessors’ discrimination between dimensions, thereby enhancing the discriminant validity component of construct validity. Furthermore, it is this evidence of discriminant validity that has proven most elusive to past AC researchers.

Convergence With Personality Measures and Supervisors’ Dimension Ratings We hypothesized (Hypothesis 3) that, compared with control assessors, FOR-trained assessors would show stronger correlations between their leadership ratings and (a) self-monitoring, (b) social skills, and (c) need for dominance. These correlations are shown in

Table 4 Mono- and Hetero- Trait and Method Correlations for Frame-of-Reference (FOR) and Control Assessors—Hypothesis 2 Leaderless group discussion Dimensions within exercises Leaderless group discussion C D L Presentation C D L Interview C D L

Presentation

Interview

C

D

L

C

D

L

C

D

L

— .64HM .63HM

.80HM — .73HM

.77HM .83HM —

.38MH .38HH .46HH

.38HH .39MH .36HH

.40HH .43HH .39MH

.51MH .41HH .48HH

.49HH .52MH .50HH

.45HH .46HH .48MH

.31MH .22HH .19HH

.10HH .18MH .18HH

.13HH .15HH .25MH

— .54HM .68HM

.60HM — .63HM

.67HM .76HM —

.52MH .48HH .52HH

.51HH .57MH .58HH

.46HH .45HH .52MH

.42MH .37HH .44HH

.38HH .40MH .33HH

.43HH .38HH .44MH

.40MH .32HH .27HH

.27HH .28MH .27HH

.25HH .40HH .35MH

— .73HM .69HM

.81HM — .64HM

.71HM .71HM —

Note. FOR raters are below the diagonal, and control raters are above the diagonal. Individual correlations are based on an N ⫽ 122. C ⫽ Communication skills; HM ⫽ heterotrait–monomethod correlations; MH ⫽ monotrait– heteromethod correlations; HH ⫽ heterotrait– heteromethod correlations; D ⫽ Decision-making skills; L ⫽ Leadership skills.

SCHLEICHER, DAY, MAYES, AND RIGGIO

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

742

Table 5. Results from t tests for significant differences between two nonindependent correlations support Hypothesis 3b (social skills: r ⫽ .28 for FOR ratings and r ⫽ .14 for control ratings), t(55) ⫽ 2.38, p ⬍ .05 (one-tailed; R2 ⫽ .097). Results for Hypothesis 3a (self-monitoring) and Hypothesis 3c (need for dominance) were in the hypothesized direction, but these were not statistically significant ( ps ⬎ .05, one-tailed; R2s ⫽ .007 and .009, respectively). We also hypothesized (Hypothesis 4) that, compared with control assessors, there would be stronger correlations between the communication skills ratings provided by FOR raters and (a) self-monitoring and (b) social skills personality variables. These results are presented in Table 5. Results for Hypothesis 4a (selfmonitoring) and Hypothesis 4b (social skills) were in the hypothesized direction but were not statistically significant (self-monitoring: r ⫽ .32 for FOR ratings and r ⫽ .27 for control ratings), p ⬎ .05, one-tailed, R2 ⫽ .013 (social skills: r ⫽ .22 for FOR ratings and r ⫽ .13 for control ratings), t(55) ⫽ 1.34, p ⫽ .08 (one-tailed; R2 ⫽ .033). In sum, support for the personality hypotheses was somewhat weak. In all five of the hypothesized relationships for the personality variables, the pattern of the FOR and control correlations was in the predicted direction, but only one of these differences was significant. Perhaps the most relevant convergent measures available in the present study are the supervisors’ ratings of leadership, communication skills, and decision-making skills. These measures are essentially an assessment of the same dimensions as were evaluated in the AC, by different people, at a different time, and in a different setting. As such, these measures are also seen as convergent measures in a construct validity framework. An examination of these relationships did indeed reveal convergence. Specifically, Hypothesis 5 predicted that the dimension ratings for FOR-trained assessors, as compared with those for control-trained assessors, would be more highly correlated with the corresponding dimension ratings as provided by the supervisor. As Table 5 shows, this hypothesis was supported for two of the three dimensions. The correlations between assessment dimension and corresponding suTable 5 Convergent Correlations Between Assessment Center (AC) Ratings and Personality Variables and Supervisor Dimension Ratings—Hypotheses 3–5

Leadership AC ratings Self-monitoring Social skills Need for dominance Supervisor leadership ratings Communication skills AC ratings Self-monitoring Social skills Supervisor Communication skills ratings Decision-making skills AC ratings Supervisor decision-making skills ratings

FOR

Control

ta

R2

.36* .28* .44** .42**

.32* .14 .40** .27

0.62 2.38* 0.70 2.15*

.007 .097 .009 .117

.32* .22 .12

.27 .13 .00

0.84 1.34 1.80*

.013 .033 .085

.26

.28

⫺0.28

.002

Note. N ⫽ 40 –58. FOR ⫽ frame of reference. One-tailed test for significant differences among two nonindependent correlations. * p ⱕ .05. ** p ⱕ .01.

a

pervisor dimension ratings were .12 (FOR) versus .00 (control) for communication skills, .26 (FOR) versus .28 (control) for decisionmaking skills, and .42 (FOR) versus .27 (control) for leadership skills. For communication and leadership skills, the FOR correlation was significantly ( p ⬍ .05) higher than the control correlation (see Table 5 for statistical test results). Given the pronounced differences across these two measurement episodes, these findings indicate relatively high levels of convergence (with the exception of the communication skills dimension) and suggest that the failure to obtain significant results with the personality measures may be more a function of the personality variables available than it is evidence for the lack of convergent validity of the AC ratings (particularly the FOR ratings).

Accuracy The final hypothesis bearing on the strength of Inference 1 predicted that FOR-trained assessors would make more accurate ratings than would control assessors (Hypothesis 6). Given the conceptual overlap of the five accuracy indexes and their statistically significant intercorrelations (see Table 2), this hypothesis was tested via multivariate analysis of variance, with training (FOR vs. control) as the independent variable and the five accuracy indexes as the multiple dependent variables. As we predicted, ratings provided by FOR-trained assessors were significantly more accurate than those made by control assessors, F(5, 94) ⫽ 8.10, p ⬍ .001, Wilks’s ⌳ ⫽ .70; partial ␩2 ⫽ .242. Follow-up discriminant function analyses revealed that this effect was driven primarily by elevation, differential elevation, and differential accuracy, with standardized discriminant function coefficients of ⫺.73, ⫺.75, and ⫺.45, respectively. As a means of estimating the effect size associated with each accuracy dependent variable, we also computed univariate analyses of variance (ANOVAs). Table 6 reports the results of these ANOVAs along with the accuracy means for the FOR and control groups. Not surprisingly, results from these individual ANOVAs parallel those for the follow-up discriminant function analyses. The current results regarding accuracy are in line with robust findings showing that FOR training is the best approach for improving rating accuracy (see Woehr & Huffcutt, 1994) and extend the effectiveness of FOR training from the performance appraisal setting to the AC setting. Unlike typical performance appraisal situations in which political motivations often color-the ratings (Longenecker, Sims, & Gioia, 1987), there are unlikely to be political motivations involved in ACs, given that assessors typically neither know the assessees (Spychalski et al., 1997) nor will be accountable to them for their ratings. Therefore, one must assume that any exercise effects originate from an inability to distinguish between the dimensions rather than any motivational reasons on the part of the assessors, making accuracy an important criterion in ACs.

Criterial Relationships No previous studies that have examined variables thought to improve the convergent and discriminant validity of assessment ratings have also assessed to what extent criterion-related validities are similarly affected (Klimoski & Brickner, 1987). This is likely due to the difficulty of having the degree of control required to manipulate factors believed to affect construct validity and yet having real-world outcomes available as criteria. The present study

FRAME-OF-REFERENCE TRAINING IN ASSESSMENT CENTERS

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Table 6 Analysis of Variance Results for the Five Accuracy Components—Hypothesis 6 Accuracy

FOR

Control

F

p

R2

Elevation Differential elevation Differential accuracy Stereotype accuracy Borman’s differential accuracy

0.36 0.53 0.42 0.27 1.03

1.02 0.67 0.45 0.30 0.91

17.31 11.55 3.81 0.40 0.01

⬍.001 ⬍.001 ⬍.05 ⬎.05 ⬎.05

.14 .09 .03 .01 .00

Note. For elevation, differential elevation, differential accuracy, and stereotype accuracy, smaller numbers represent greater accuracy. For Borman’s differential accuracy, larger numbers represent greater accuracy. FOR ⫽ frame of reference.

provided a unique opportunity for examining both of these issues. Hypothesis 7 predicted that there would be greater criterion-related validity with the FOR ratings than with the control ratings. The criterion of interest was supervisory job performance ratings. The criterion-related validity coefficient for the FOR ratings was .32 ( p ⬍ .05), compared with .21 ( p ⬎ .05) for the control ratings. These two correlations were significantly different, t(40) ⫽ 1.95, p ⬍ .05 (one-tailed; R2 ⫽ .091), indicating that the FOR ratings were significantly more predictive of performance than the control ratings, thus supporting Hypothesis 7. Supplemental analyses were undertaken to examine whether the AC ratings provided incremental validity beyond college grade point average and American Assembly of Collegiate Schools of Business scores (a standardized exam assessing knowledge of core business areas). The results indicated that whereas FOR ratings accounted for a significant ( p ⬍ .05) amount of unique variance in performance over these other predictors, the control ratings did not (complete results for these analyses are reported in Table 7). We noted above that the improved criterion-related validity of FOR-generated ratings was likely due to their better construct validity (e.g., their superior reliability, accuracy, and convergent– discriminant validity). Unfortunately, there is no way to directly test the hypothesized mediating role of these variables in the present study. However, indirect evidence for the importance of accuracy comes from examining the criterion-related validity associated with the comparison score ratings. The comparison scores were slightly more predictive of the job performance ratings than were the FOR ratings (.36 vs. .32, respectively), implying that it may be their similarity to comparison scores (i.e., accuracy) that was responsible for the improved criterion-related validity of the FOR ratings. Interestingly, two separate aspects of the results suggest that the effectiveness of FOR training may lie in more finely tuned calibration of the assessors on the dimensions rather than in merely teaching the correct rank order of assessees on dimensions. First, Borman’s (1977) differential accuracy index, which provides only correlational information, was not improved as a result of FOR training but the Cronbach accuracy components, which provide distance information, were. Second was the high correlation (r ⫽ .93) we found between the FOR and control ratings. Although this finding may initially lead one to question the significance of a difference between FOR and control ratings, it should be noted that this correlation represents only rank-order association. The practical significance of both of these findings is that assessors may reach similar conclusions on rank order of

743

candidates regardless of training. However, the other differences between FOR and control ratings (e.g., mean differences in distance accuracy and convergent and discriminant relations) become extremely important when ACs are used for training and development purposes. The criterion-related validity coefficients obtained in the present study, particularly the FOR-based coefficients, approach the magnitude of the meta-analytic coefficients found in the literature (.37, corrected [Gaugler et al., 1987]; .43 [Hunter & Hunter, 1984]; and .41 [Schmitt et al., 1984]). As such, the present study provides further support for the predictive validity of ACs. These findings are particularly good news for this specific AC, given that there were many factors in the present study (e.g., data were collected across multiple organizations, no direct or indirect criterion contamination; Klimoski & Brickner, 1987) that should, if anything, lead to a conservative estimate of the criterion-related validity.

Conclusion and Future Directions In summary, the present study examined many different types of evidence to demonstrate the effectiveness of FOR training in improving the construct and criterion-related validity of ACs. Results showed that FOR training, as compared with control training, improved the reliability and the accuracy of AC ratings. There was also improved discriminant validity associated with the FOR assessment ratings in the form of smaller heterotrait– monomethod and heterotrait– heteromethod correlations and somewhat improved convergent validity in the form of larger correlations with external measures of the same and similar constructs. Finally, FOR training significantly improved the criterion-related validity of the current AC for predicting supervisors’ ratings of job performance. Implications for both FOR training and AC practice are discussed below, after a review of some of the possible limitations associated with the current study.

Limitations of the Present Study The two primary methodological goals in the present study were (a) to ensure the ability to make strong internal validity inferences regarding the effect of FOR training and (b) to eliminate most, if not all, potential contaminants (i.e., confounds) that typically exist in AC validity studies (e.g., direct and indirect criterion contamination; Klimoski & Brickner, 1987). In light of these goals, we address specific potential generalizability concerns. First, because of the low response rate from the survey, potential response biases are of concern. Although there were no differences between re-

Table 7 Regression Results for the Incremental Validity of the Assessment Center (AC) Ratings 1st variable block a

GPA , AACSB score GPA, AACSB score

R2

2nd variable block

R2

⌬R2

p

.05 .05

FOR AC ratings Control AC ratings

.24 .11

.19* .07

.037 .240

Note. GPA ⫽ grade point average; AACSB ⫽ American Assembly of Collegiate Schools of Business content knowledge exam; FOR ⫽ frame of reference. a Collegiate GPA at time of AC participation. * p ⱕ .05.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

744

SCHLEICHER, DAY, MAYES, AND RIGGIO

spondents and nonrespondents in terms of age, gender, ethnicity, or any of the personality variables, this is a limited set of variables on which possible differences were assessed. Accordingly, there may be representativeness issues with our sample; this is a possible limitation that should be addressed in future studies of this type. Second, two of the AC exercises were truncated to 5 min. Although the impression formation process literature suggests that assessors’ ratings of performance are unlikely to change after the first 5 min (e.g., Ambady & Rosenthal, 1992), this decision presents possible generalizability concerns. It also suggests that it may be interesting for future research to examine whether FOR training mitigates this tendency to form impressions so early on and to not revise them in light of additional behavior. Finally, there may be concerns arising from the use of students as both assessees and assessors. With regard to the former, for a variety of reasons (e.g., mean age, business student status, previous work experience, industries entered), the current assessees may be precisely the candidates likely to be involved in most ACs used in industry (with the exception of those designed for executive-level assessment). The issue of student assessors is admittedly more challenging. However, it is not clear who should have been chosen as “assessor” subjects (i.e., there are very few who possess the job title “assessor”). Instead, special effort was made to recruit subjects as assessors who had managerial experience so as to improve this aspect of generalizability.

Implications and Future Directions for FOR Training Previous research on FOR training offers several possible explanations for its effectiveness in the AC context in the present study, including a reduction in cognitive load. For example, Day and Sulsky (1995) have suggested that FOR training helps raters form more accurate on-line impressions, reducing the need to recall behaviors for each assessee in each exercise and then categorize these behaviors into relevant dimensions. This FOR effect would likely be even larger if a greater number of performance dimensions were used than those in the present study. That is, control raters have a better chance of keeping 3 dimensions straight rather than 8 or 10. As the number of assessment dimensions increases, so might the advantages associated with FOR training. It has also been noted that the AC may provide such a strong, exercise-based organizational framework that assessors are forced to code the behavior by exercise rather than by dimension. Strong manipulations may be necessary to modify these schemas to allow the assessors to successfully reorganize information by dimension (Harris et al., 1993; Silverman et al., 1986). FOR training represents one such manipulation. Research has shown that FOR training changes the impression formation process of people to more trait-based (as opposed to behavior-based) representations (Schleicher & Day, 1998). We assumed that this FOR training feature would reduce exercise effects in the present study. Results suggest that FOR training did have some impact in this regard. Moreover, supplemental exploratory analyses on the present data revealed that impressions formed by FOR-trained assessors (coded from a free-recall task) were significantly more target referent and significantly more abstract than those formed by control-trained assessors (Carlston, 1994). In other words, FOR assessors’ impressions primarily used categories and traits (i.e., assessment dimensions)—as opposed to behaviors—as the main representational

forms. Future research should further pursue this proposed mediating role of the impression formation process of assessors.

Implications and Future Directions for ACs The present results regarding the effectiveness of FOR training also have implications for AC practice. They suggest that FOR training should be useful in AC practice given that it is no more lengthy or expensive than control training, enhances the legal defensibility of the AC (Werner & Bolino, 1997), and appears to enhance AC validity. That is, previous research has established FOR training as effective in improving rating accuracy (e.g., Woehr & Huffcutt, 1994), its relationship to components of procedural justice has been noted (see Schleicher & Day, 1998), and the preliminary evidence presented here suggests that it improves both construct and criterion-related validity of AC ratings. FOR training represents a fundamentally different approach to training assessors than what has typically been used. For example, many AC training programs focus on the accuracy of the observation and recall of behavior (see Byham, 1977; Spychalski et al., 1997). FOR training, in contrast, focuses on the accuracy of impressions formed on the various performance dimensions. In fact, FOR training often leads to less accurate recall of behavior while it simultaneously enhances the accuracy of impressions (Sulsky & Day, 1992). The point of this distinction is to assure the AC practitioner that the present study does more than simply show that training assessors can enhance both the construct and criterion-related validity of assessment ratings; rather, it also suggests that the best approach to training may be one that is fundamentally different from the typical approach. An important, albeit implicit assumption that guided the present study was that there is no true exercise effect (Neidig & Neidig, 1984); that is, dimensions, not exercises, underlie AC ratings. This point is reminiscent of the sign versus sample distinction (Wernimont & Campbell, 1968) with its corresponding questions of how to view the AC and thus validate it. The present position is that the AC is a sign, not merely a work sample, in that it offers an opportunity to observe where assessees fall on dimensions that may be important for future performance. In fact, the criterionrelated validity coefficients are strongly supportive of such a position. That is, dimension ratings in an educational AC predicted job performance 4 years later across several different jobs and organizations. It is not reasonable to assume that these exercises met the requirements for being considered work samples of the jobs now held by these assessees. Therefore, the basis behind the criterion-related validity of the present AC must be an accurate evaluation of assessees on dimensions believed to be important for job performance (i.e., a sign) and not merely a work-sample test. The above discussion begs the question of whether it matters if we are measuring performance on dimensions (a sign) or performance on exercises (a sample), as long as the AC ratings predict performance on the job. There are several reasons to believe that this is an important distinction and that paying attention to the dimensions included in ACs matters. First, as the present study suggests, using a dimension-focused approach to training can have a positive impact on the criterion-related validity of the AC. Second, ACs are increasingly being used for training and development purposes (Cohen, 1980; Spychalski et al., 1997), making the accurate assessment of personnel on construct-valid dimensions increasingly more important. Finally, the emphasis on di-

FRAME-OF-REFERENCE TRAINING IN ASSESSMENT CENTERS

mensions is crucial to the extent that the AC is designed to assess potential, and early identification of management potential is still one of the most common purposes for ACs (Gaugler et al., 1987). As Klimoski and Brickner (1987) noted, “In terms of priorities, given the real need of organizations to assess potential (apart from competencies), it would seem most important to establish if, or under what conditions, ACs can be made to produce valid measures of constructs” (p. 255). The present study preliminarily suggests that FOR training provides at least one of these conditions.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

References Ambady, N., & Rosenthal, R. (1992). Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis. Psychological Bulletin, 111, 256 –274. Anastasi, A. (1986). Evolving concepts of test validation. Annual Review of Psychology, 37, 1–15. Archambeau, D. J. (1979). Relationships among skill ratings assigned in an assessment center. Journal of Assessment Center Technology, 2, 7–20. Athey, T. R., & McIntyre, R. M. (1987). Effect of rater training on rater accuracy: Levels-of-processing theory and social facilitation perspectives. Journal of Applied Psychology, 72, 567–572. Bernardin, H. J., & Buckley, M. R. (1981). Strategies in rater training. Academy of Management Review, 6, 205–212. Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions: A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74, 478 – 494. Borman, W. C. (1977). Consistency of rating accuracy and rating errors in the judgment of human performance. Organizational Behavior and Human Performance, 20, 238 –252. Brannick, M. T., Michaels, C. E., & Baker, D. P. (1989). Construct validity of in-basket scores. Journal of Applied Psychology, 74, 957–963. Bray, D. W., & Grant, D. L. (1966). The assessment center in the measurement of potential for business management. Psychological Monographs: General & Applied, 80(17, Whole No. 625). Briggs, S. R., & Cheek, J. M. (1988). On the nature of self-monitoring: Problems with assessment, problems with validity. Journal of Personality and Social Psychology, 54, 663– 678. Bycio, P., Alvares, K. M., & Hahn, J. (1987). Situational specificity in assessment center ratings: A confirmatory factor analysis. Journal of Applied Psychology, 72, 463– 474. Byham, W. C. (1977). Assessor selection and training. In J. L. Moses & W. C. Byham (Eds.), Applying the assessment center method (pp. 89 –126). New York: Pergamon Press. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait–multimethod matrix. Psychological Bulletin, 56, 81–105. Cardy, R. L., & Keefe, T. J. (1994). Observational purpose and evaluative articulation in frame-of-reference training: The effects of alternative processing modes on rating accuracy. Organizational Behavior and Human Decision Processes, 57, 338 –357. Carlston, D. E. (1994). Associated systems theory: A systematic approach to cognitive representations of persons. Advances in Social Cognition, 7, 1–78. Cohen, S. L. (1980). The bottom line on assessment center technology. Personnel Administrator, 25, 50 –56. Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Chicago: Rand McNally. Cronbach, L. J. (1955). Processes affecting scores on “understanding of others” and “assumed similarity.” Psychological Bulletin, 52, 177–193. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Day, D. V., Schleicher, D. J., Unckless, A. L., & Hiller, N. J. (2002).

745

Self-monitoring personality at work: A meta-analytic investigation of construct validity. Journal of Applied Psychology, 87, 390 – 401. Day, D. V., & Sulsky, L. M. (1995). Effects of frame-of-reference training and information configuration on memory organization and rating accuracy. Journal of Applied Psychology, 80, 158 –167. Edwards, J. E., Thomas, M. D., Rosenfeld, P., & Booth-Kewley, S. (1997). How to conduct organizational surveys: A step-by-step guide. Thousand Oaks, CA: Sage. Fleenor, J. W. (1996). Constructs and developmental assessment centers: Further troubling empirical findings. Journal of Business and Psychology, 10, 319 –335. Friedman, H. S., Prince, L. M., Riggio, R. E., & DiMatteo, M. R. (1980). Understanding and assessing nonverbal expressiveness: The Affective Communication Test. Journal of Personality and Social Psychology, 39, 333–351. Gangestad, S. W., & Snyder, M. (2000). Self-monitoring: Appraisal and reappraisal. Psychological Bulletin, 126, 530 –555. Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., III, & Benston, C. (1987). Meta-analysis of assessment center validity. Journal of Applied Psychology, 72, 493–511. Harris, M. M., Becker, A. S., & Smith, D. E. (1993). Does the assessment center scoring method affect the cross-situational consistency of ratings? Journal of Applied Psychology, 78, 675– 678. Harvey, R. J. (1991). Job analysis. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology (2nd ed., pp. 71–163). Palo Alto, CA: Consulting Psychologists Press. Hunter, J. E., & Hunter, R. F. (1984). Validity and utility of alternative predictors of job performance. Psychological Bulletin, 96, 72–98. Ickes, W., & Barnes, R. D. (1977). The role of sex and self-monitoring in unstructured dyadic interactions. Journal of Personality and Social Psychology, 35, 315–330. Klimoski, R. J., & Brickner, M. (1987). Why do assessment centers work? The puzzle of assessment center validity. Personnel Psychology, 40, 243–260. Landy, F. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist, 41, 1183–1192. Longenecker, C. O., Sims, H. P., & Gioia, D. A. (1987). Behind the mask: The politics of employee appraisal. Academy of Management Executive, 1, 183–193. McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30 – 46. McIntyre, R. M., Smith, D. E., & Hassett, C. E. (1984). Accuracy of performance ratings as affected by rater training and purpose of rating. Journal of Applied Psychology, 69, 147–156. Neidig, R. D., Martin, J. C., & Yates, R. E. (1979). The contribution of exercise skill ratings to final assessment center evaluations. Journal of Assessment Center Technology, 2, 21–23. Neidig, R. D., & Neidig, P. J. (1984). Multiple assessment center exercises and job relatedness. Journal of Applied Psychology, 69, 182–186. Norton, S. D. (1981). The assessment center process and content validity: A reply to Dreher and Sackett. Academy of Management Review, 6, 561–566. Pulakos, E. D. (1984). A comparison of training programs: Error training and accuracy training. Journal of Applied Psychology, 69, 581–588. Pulakos, E. D. (1986). The development of training programs to increase accuracy with different rating tasks. Organizational Behavior and Human Decision Processes, 38, 76 –91. Reilly, R. R., Henry, S., & Smither, J. W. (1990). An examination of the effects of using behavior checklists on the construct validity of assessment center dimensions. Personnel Psychology, 43, 71– 84. Riggio, R. E. (1986). Assessment of basic social skills. Journal of Personality and Social Psychology, 51, 649 – 660. Riggio, R. E. (1989). Social Skills Inventory manual. Palo Alto, CA: Consulting Psychologists Press. Riggio, R. E., Aguirre, M., Mayes, B. T., Belloli, C., & Kubiak, C. (1997).

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

746

SCHLEICHER, DAY, MAYES, AND RIGGIO

The use of assessment center methods for student outcome assessment. Journal of Social Behavior and Personality, 12, 1–15. Robertson, I., Gratton, L., & Sharpley, D. (1987). The psychometric properties and design of assessment centres: Dimensions into exercises won’t go. Journal of Occupational Psychology, 60, 187–195. Rosenthal, R., Hall, J. A., DiMatteo, M. R., Rogers, P. L., & Archer, D. (1979). Sensitivity to nonverbal communications: The PONS test. Baltimore: Johns Hopkins University Press. Sackett, P. R. (1987). Assessment centers and content validity: Some neglected issues. Personnel Psychology, 40, 13–25. Sackett, P. R., & Dreher, G. F. (1982). Constructs and assessment center dimensions: Some troubling empirical findings. Journal of Applied Psychology, 67, 401– 410. Schleicher, D. J., & Day, D. V. (1998). A cognitive evaluation of frameof-reference training: Content and process issues. Organizational Behavior and Human Decision Processes, 73, 76 –101. Schmidt, F. L., Hunter, J. E., & Urry, V. W. (1976). Statistical power in criterion-related validation studies. Journal of Applied Psychology, 61, 473– 485. Schmitt, N. (1977). Interrater agreement in dimensionality and combination of assessment center judgments. Journal of Applied Psychology, 62, 171–176. Schmitt, N., Gooding, R. Z., Noe, R. A., & Kirsch, M. (1984). Metaanalysis of validity studies published between 1964 and 1982 and the investigation of study characteristics. Personnel Psychology, 37, 407– 422. Schmitt, N., Schneider, J. R., & Cohen, S. A. (1990). Factors affecting validity of a regionally administered assessment center. Personnel Psychology, 43, 1–12. Schneider, J. R., & Schmitt, N. (1992). An exercise design approach to understanding assessment center dimension and exercise constructs. Journal of Applied Psychology, 77, 32– 41. Shore, T. H., Thornton, G. C., & McFarlane Shore, L. (1990). Construct validity of two categories of assessment center ratings. Personnel Psychology, 42, 101–113. Silverman, W. H., Dalessio, A., Woods, S. B., & Johnson, R. L., Jr. (1986). Influence of assessment center methods on assessors’ ratings. Personnel Psychology, 39, 565–578. Smith, P. C., & Kendall, L. M. (1963). Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47, 149 –155. Snyder, M. (1987). Public appearances/private realities: The psychology of self-monitoring. New York: Freeman. Snyder, M., & Gangestad, S. (1986). On the nature of self-monitoring: Matters of assessment, matters of validity. Journal of Personality and Social Psychology, 51, 125–139. Spychalski, A. C., Quinones, M. A., Gaugler, B. B., & Pohley, K. (1997). A survey of assessment center practices in organizations in the United States. Personnel Psychology, 50, 71–90. Stamoulis, D. T., & Hauenstein, N. M. A. (1993). Rater training and rating

accuracy: Training for dimensional accuracy versus training for ratee differentiation. Journal of Applied Psychology, 78, 994 –1003. Steers, R. M., & Braunstein, D. N. (1976). A behaviorally-based measure of manifest needs in work settings. Journal of Vocational Behavior, 9, 251–266. Steiger, J. H. (1980). Tests for comparing elements of a correlation matrix. Psychological Bulletin, 87, 245–251. Stein, R. T., & Heller, T. (1979). An empirical analysis of the correlations between leadership status and participation rates reported in the literature. Journal of Personality and Social Psychology, 37, 1193–2002. Sulsky, L. M., & Balzer, W. K. (1988). Meaning and measurement of performance rating accuracy: Some methodological and theoretical concerns. Journal of Applied Psychology, 73, 497–506. Sulsky, L. M., & Day, D. V. (1992). Frame-of-reference training and cognitive categorization: An empirical investigation of rater memory issues. Journal of Applied Psychology, 77, 501–510. Sulsky, L. M., & Day, D. V. (1994). Effects of frame-of-reference training on rater accuracy under alternative time delays. Journal of Applied Psychology, 79, 535–543. Sypher, B. D., & Sypher, H. E. (1983). Perceptions of communication ability: Self-monitoring in an organizational setting. Personality and Social Psychology Bulletin, 9, 297–304. Task Force on Assessment Center Guidelines. (1989). Guidelines and ethical considerations for assessment center operations. Public Personnel Management, 18, 457– 470. Task Force on Assessment Center Standards. (1980). Standards and ethical considerations for assessment center operations. Personnel Administrator, 25, 35–38. Thomson, H. A. (1970). Comparison of predictor and criterion judgments of managerial performance using the multitrait–multimethod approach. Journal of Applied Psychology, 54, 496 –502. Turnage, J. J., & Muchinsky, P. M. (1982). Transsituational variability in human performance within assessment centers. Organizational Behavior and Human Performance, 30, 174 –200. Werner, J. M., & Bolino, M. C. (1997). Explaining U.S. Courts of Appeals decisions involving performance appraisal: Accuracy, fairness, and validation. Personnel Psychology, 50, 1–24. Wernimont, P. F., & Campbell, J. P. (1968). Signs, samples, and criteria. Journal of Applied Psychology, 52, 372–376. Woehr, D. J. (1994). Understanding frame-of-reference training: The impact of training on the recall of performance information. Journal of Applied Psychology, 79, 525–534. Woehr, D. J., & Huffcutt, A. I. (1994). Rater training for performance appraisal: A quantitative review. Journal of Occupational and Organizational Psychology, 67, 189 –205.

Received July 24, 2000 Revision received November 20, 2001 Accepted November 20, 2001 䡲