the development of behavioral observation ... - Wiley Online Library

PERSONNEL PSYCHOLOGY 1979, 32

THE DEVELOPMENT OF BEHAVIORAL OBSERVATION SCALES FOR APPRAISING THE PERFORMANCE OF FOREMEN GARY P. LATHAM, CHARLES H . FAY and LISE M. SAARl University of Washington

Behavioral observation scales (BOS) were developed for first line foremen. BOS are similar t o behavioral expectation scales (BES) in that both are based on a job analysis procedure known as the critical incident technique. However, the BOS differ from BES in that, in developing BOS, (a) a group of individuals is observed and rated on a five-point scale as to the frequency with which they engage in the behavior described by each incident/statement, (b) a total score for each individual is determined by summing the observer’s responses for each behavioral item, and (c) an item analysis (or factor analysis, depending upon the sample size) is conducted to select the most discriminating items. Those items with the highest correlations with the total score on a scale are retained to form one behavioral criterion or scale (BOS).

THEappraisal of managerial performance is a key process for an orgahization trying to increase or maintain its effectiveness. This is because the actions managers take affect the use of the organization’s capital, technological, and human resources. Despite the obvious importance of making accurate appraisals of managerial effectiveness, most performance appraisal instruments, if they exist at all within an organization, consist of a list of “key traits” (e.g., “is a team player,” “is conscientious,” “shows initiative”) or cost-related variables (e.g., “sold 10,000 widgets”). The limitations of evaluating managers on the basis of traits or costrelated variables have been discussed elsewhere (Campbell, Dunnette, Lawler, and Weick, 1970 Latham and Mitchell, 1976; Latham and Wexley, 1977). In brief, the use of traits (e.g., “ambitious,” “aggressive,” “task oriented”) often causes confusion and misunderstanding The authors are grateful to C. H. Bell, F. E. Fiedler, and T. R. Mitchell for their constructive comments in preparing this manuscript. Requests for reprints should be sent to Gary Latham, College of Business Administration, DJ-10, University of Washington, Seattle, Washington 98195. Copyright @ 1979 by PERSONNEL PSYCHOLOGY, INC.

299

300

PERSONNEL PSYCHOLOGY

because the supervisor and the subordinate frequently define such terms differently. Cost related variables are particularly troublesome for defining, developing and/or maintaining “managerial excellence.” This is because they implicitly encourage a results at all costs mentality that may lead to conflict with corporate ethics policies. In addition, such variables frequently contain elements beyond the control of any one individual, they frequently leave out factors for which a given individual should be held accountable, and most importantly they do not tell the individual what it is that he is actually doing that is effective or ineffective in performing his job. It may be easy to determine whether an employee is or is not meeting a set of objectives, but the answer(s) to the question(s) of how or why can remain elusive. For these reasons, it is paramount for organizations interested in developing their human resources to define employee excellence in explicit behavioral terms, that is, in terms of the observable things that employees do that enable them to be successful. The immediate advantages of such an approach are that, in addition to overcoming some of the problems inherent in trait and cost related measures, this approach (a) lends itself to training programs by specifying the content of the training in terms of the skills and knowledge in which employees are deficient, (b) permits comprehensive job descriptions, and (c) facilitates decisions regarding manpower planning and staffing. The procedure most frequently used in the past 15 years for developing behavioral criteria is the behavioral expectation scale (BES) developed by Smith and Kendall(l963). This method is based on a job analysis procedure known as the critical incident technique (CIT). The CIT (Flanagan, 1954) requires observers who are aware of the aims and objectives of a given job and who see people perform the job on a frequent basis (e.g., daily) to describe incidents of job performance that they have observed over the past 6 to 12 months. With regard to each incident, the observer is requested to specify (a) the circumstances, background or context, (b) exactly what the person did that was effective or ineffective, and (c) how the incident was an example of effective or ineffective behavior. Effectiveness is defined as behavior which the observer wished he could see on the part of all job incumbents in similar situations. Ineffectiveness is defined as behavior which, if it occurs repeatedly, or even once under certain circumstances, would make the observer doubt the competence of that individual, After all the incidents have been collected, a group of job incumbents categorizes the incidents into overall job categories (e.g., job knowledge, motivation, interactions with subordinates). Each category serves as one criterion for evaluating an employee.

GARY P. LATHAM ET AL.

301

A third group of job incumbents is given the list of critical incidents and the job criteria (i.e., the categories developed by a previous group of supervisors). These individuals are asked to individually allocate each incident to the one criterion that they believe the critical incident illustrates. Those incidents which are not assigned to the same dimension by more than a certain percentage (e.g., 80%) of the judges are eliminated. In this way, ambiguous incidents are eliminated, and independent (non-overlapping) performance criteria are believed to be determined. The next step is to give still another group of individuals, who are familiar with the job, a booklet containing the performance criteria categories and the list of incidents which the previous judges agreed defined each criterion. This group of judges is then asked to rate each incident, usually on a 7-point scale, as representing good, average, or poor performance for the job of interest. Only those items for which there is a high degree of interjudge agreement are retained. The numerical value given to each of these items is the mean of all the judges’ ratings. These items are used as anchors or benchmarks on the rating scale, hence the frequently used term, behaviorally anchored rating scales (BARS), a term synonymous with BES. The term BES is derived from the fact that the items used as anchors are reworded from actual behaviors (e.g., works overtime) to expected behaviors (e.g., could be expected to work overtime). In observing and appraising an employee, a supervisor must decide if the behaviors he has seen would lead him to expect any of the behaviors shown along the scale. In essence, the anchors are simply illustrations or aids to assist the supervisor in defining an employee’s behavior as superior, average, or below average during a given employment period. There are at least two important advantages of BES. As pointed out by Wexley and Yukl (1977), the anchors are behavioral in nature and are expressed in the raters’ own terminology. This eliminates much of the ambiguity found in rating scales based on key traits. In addition, these scales lend themselves to employee development by giving the employee behavioral feedback. Unfortunately, the BES has several limitations. As Schwab, Heneman, and DeCotiis (1975) have noted, a substantial number of critical incidents generated in step 1 are discarded in the subsequent steps. . . . if one assumes that the original pool of incidents generated in any BARS study all represent behaviors that an evaluator may see and assess in an applied setting, instruments defined and anchored by relatively few examples could create at least two problems. First, the evaluator may have difficulty assigning observed behaviors to specific dimensions. Second, the evaluator may have difficulty deciding the value of effectiveness of the observed behavior against the examples provided. Both of these problems would obviously be potential sources of error variance.” (Schwab, et al., 1975, p. 558)

302


A second problem, cited by these same authors, is that the subjective process used in developing the individual appraisal criterion may result in criterion categories that are not independent. A procedure that overcomes each of these limitations, but retains the advantages of BES/BARS is the development of behavioral observation scales or BOS (Latham and Wexley, 1977). The primary difference between the two procedures is in essence the same as that which differentiates the Thurstone (1929) and Likert (1932) approaches to the development of attitude scales. The development of,the BES is similar to the Thurstone approach in that judges are given the incidents obtained from the job analysis to rate numerically in terms of the extent to which they represent effective job behavior. The BOS is similar to the Likert method in that (1) a large number of incidents/statements related to the object in question are collected; (2) a group of individuals are observed and rated on a five-point scale as t o the frequency with which they engage in the behavior described by each incident/statement; (3) a total score for each individual is determined by summing the observer’s responses to all the behavioral items; and (4)an item analysis (or factor analysis, depending upon the size of the sample) is conducted to select the most discriminating items. Those items with the highest correlations with the total score on the scale are retained to form a behavioral criterion. It is the use of item analysis in the Likert and BOS method to select items that most clearly distinguishes it from the Thurstone/BES method. The purpose of this research was to show how BOS can be developed for first line foremen. Although the results may be specific to the management population found in the client company that sponsored this research, the procedures used to develop the criteria are transferable to other organizational settings. Method Sample and Procedure Superintendents ( N = 20), foremen ( N = 20), and hourly employees (N = 20) were interviewed in accordance with the critical incident technique (CIT). Each interviewee was asked to describe five effective incidents and five ineffective incidents of foremen behavior. Superintendents, foremen, and hourly employees were interviewed because they are aware of the aims and objectives of the foreman’s job, they observe foremen on the job daily, and thus they are considered knowledgeable of the foreman’s job. By interviewing samples from these three populations, a relatively comprehensive description of the foreman’s job was ensured.

G A R Y P. LATHAM ET AL.

303

The superintendents, foremen, and hourly employees were interviewed individually. Each incident was tape recorded and transcribed in full in order to ensure accurate recording. The name of the individual described in an incident was not requested. Foremen were asked to cite incidents describing other foremen. This is because incidents describing one’s self can be subject to bias, particularly with regard to reporting ineffective incidents (Vroom and Maier, 1961). A maximum of 10 incidents were collected from each interviewee to prevent any one individual from biasing the data; moreover, this number of incidents could be collected within 60 minutes. A longer time period away from the job would have inconvenienced the interviewees. The above procedure is identical to that used in developing BES.’ Results Criterion development. The superintendents reported 189 incidents, the foremen reported 159 incidents, and the hourly employees reported 166 incidents for a total of 514 incidents. Critical incidents that were similar if not identical in content were grouped together to form one behavioral item. For example, two or more incidents concerning a foreman who compliments or rewards his employees for doing a good job were used as the basis for writing the item, “praises and/or rewards subordinates for specific things they do well.” Behavioral items that were similar were grouped together to form a specific criterion for performance appraisal. For example, the abovc; behavioral item was grouped together with similar items (e.g., counsels employees on personal problems) to form the criterion, “Interactions with subordinates.” In this particular study the incidents were categorized by one of the authors. In most BES studies the incidents would have been categorized by superintendents of the foremen or by a cross-section of superintendents, foremen and hourly personnel. The advantage of a researcher categorizing the data is that it takes less time than training job incumbents how to (1) write behavioral (observable) items and (2) cluster them into meaningful categories. The advantage of the job incumbents rather than the researchers categorizing the incidents, although appealing intuitively, has yet to be empirically justified. Interjudge agreement. Interjudge agreement is concerned with the issue as to whether another individual or group of individuals would have developed the same behavioral criteria based on the same critical This procedure is inductive. A deductive procedure requires a group of observers to first agree on a list of the dimensions or overall criteria (e.g., Organizational Commitment) and then generate critical incidents (i.e., specific behaviors) defining each criterion. To our knowledge no research has systematically compared the advantages and disadvantages of the inductive versus the deductive procedure.

304


incidents. This step is similar to the reallocation step (Smith and Kendall, 1963) followed by BES researchers. In the present study the critical incidents were placed in random order and given to a second author who reclassified the incidents according to the established categorization system. The ratio of interjudge agreement was calculated by counting the number of incidents that both judges agreed should be placed in a given category (intersection) divided by the total number of incidents that each judge placed in that category (union). Thus, if one judge classified incidents 4 , 7 , 8 , and 9 under the criterion “Interaction with Subordinates,’’ and a second judge claisified incidents 7, 8, 9, and 17 under the same criterion, the interjudge agreement would be .60 (7,8,9/4,7,8,9,17). An a priori decision was made that the ratio must be .80 or higher for a behavioral criterion to be considered acceptable. If the ratio was below .80, the behavioral items under the various criteria were reexamined to see if they should be classified under a different criterion and/or rewritten in terms of their specificity. The interjudge agreement for the eight criteria ranged from .86 to 1.00; thus the categorization system was considered satisfactory. Content validity. Relevance (Nagle, 1953) is to criterion development as content validity is to test construction. Relevance or content validity is concerned with the systematic evaluation of an appraisal instrument to see if it includes a representative sample of the behavioral domain of interest (Anastasi, 1976). Two tests for content validity were applied to the data. Prior to the categorization of critical incidents, 10%of the incidents were set aside. After the categorization was completed these incidents were examined to see if any of them described behaviors that had not yet appeared. If this examination necessitated the development of a new behavioral criterion or the formation of two or more behavioral items under an existing criterion, the hypothesis that a sufficient number of incidents had been collected would have been rejected. The results indicated that all the incidents that had been set aside could be classified under the existing behavioral items. The second test of content validity involved recording the increase in the number of behavioral items with the increase in the number of incidents classified. If 80% of the items appeared when 75% of the incidents had been categorized, the content validity of the BOS was considered satisfactory. The results are shown in Table 1 . When 75% of the incidents had been categorized, 91% of the behavioral items had emerged. The hypothesis that the appraisal instrument was relevant (content valid) was accepted. Instrument construction. The appraisal instrument was developed by


305

TABLE 1 Cumulative Percentage of Incidents and Behaviors

Incidents

Behavioral Items

25% 50% 15%

58%

79% 91%

attaching a 5-point Likert type scale to each behavioral item. Superintendents were asked t o indicate the frequency with which they had observed each of their foremen engage in each behavior. An example of one behavioral item is shown below. Tells crew to inform him immediately of any unsafe condition Almost Never 1 2 3 4 5 Almost Always Ninety-two behavioral items were grouped into eight criteria or behavioral observation scales as shown below: I Interaction with Subordinates I1 Interaction with Peers 111 Interaction with Supervisors IV Safety V Technical Competence VI ,Work Habits VII 'Planning Ahead VIII Record Keeping ' TOTAL

35 items 4 items 3 items 10 items 8 items 25 items 4 items 3 items 92 items

Prior to filling out the appraisal instrument, the superintendents to whom the foremen reported received a six hour training program to minimize rating errors such as the halo effect, positive and negative leniency, central tendency, first impressions, and contrast effects. The effectiveness of this training has been described elsewhere (Latham, Wexley, and Pursell, 1975). The superintendents (N = 16) then rated those foremen (N = 90) who reported to them as to the frequency with which they had observed each behavior, A foreman received a 1 if he had been observed engaging in a behavior 0-64% of the time when such behavior was appropriate during the past six months, 2 for 65-74%, 3 for 75-84%, 4 for 85-94%, and 5 for 95-100%. In some cases the items are stated in terms of ineffective behavior because that is the way the incidents were described by the interviewees during the job analysis. The points within each of the behavioral observation scales were summed and a total rating was obtained across the eight scales. The range of possible scores is shown in Table 2 as well as the actual means


306

TABLE 2 BOS Scores Earned by Foremen Potential Scores Criteria

-

I Interaction with Subordinates Interaction with Peers Interaction with Supervisors Safety Technical Competence Work Habits Planning Ahead ’ Record Keeping TOTAL

I1 I11 IV V VI VII VIII

low

high

X

S.D.

35 4 3 10 8 25 4 3 92

175 20 15 50 40 125 20 15 460

129.95 15.71 12.17 30.80 30.14 100.05 14.67 8.39 349.88

18.01 2.63 1.91 5.73 4.95 11.20 2.59 2.54 39.15

and standard deviations of scores based on the observations of the 90 foremen. These scores can be used for making job related decisions. Instrument refinement. Many items on the BOS,although critical in terms of defining highly effective or ineffective performance, are observed either so frequently or infrequently that they do not differentiate the good from the poor foremen. For example, of the 90 foremen rated on “Has the smell of liquor on his breath” 85 received a 5 (almost never), 4 received a 4 (seldom), and 1 person received a 3 (sometimes). A major purpose of a performance appraisal instrument is to differentiate between good and poor performers. The above item does not meet this requirement since almost everyone received the same rating. Thus, the decision was made to eliminate items (N = 32) with a median rating of less than 3.0 or greater than 4.0. The reliability (internal consistency) of the remaining 60 items was determined. The initial results are shown in Table 3. Low alpha coefficients were obtained for scales with a small number of items. An TABLE 3 Internal Consistency and Average Scores of Revised BOS Criteria

Number of Items

Mean Score ~~

I Interaction with Subordinates I1 Interaction with Peers 111 Interaction with Supervisors 1v Safety V Technical Competence VI Work Habits VII Planning Ahead VIII Record Keeping TOTAL

26

2 2 8 3 13 4 2

60

93.66 1.66 7.97 29.93 10.85 47.29 14.67 6.31 218.33

Alpha ~~~

~

.92 .43 .I0 .84 .42 .84 .68 .43 .96


301

TABLE 4 BOS Results after Item Analysis ~

~

Scale Category

I Interaction with Subordinates I1 Safety 111 Work Habits IV Organizational Commitment TOTAL

Number of Items

Mean Score

S.D.

21 8 14 11 54

75.38 29.08 51.26 40.42 196.13

12.65 5.21 7.74 6.02 21.10

Possible Score Low High

21 8 14 11 54

Alpha

105

40 70 55 210

iterative procedure was carried out shifting items with low item-total correlations within scales to other scales, dropping those which seemed to fit nowhere, and combining other items to form a new scale or criterion. The results are shown in Table 4. The number of criteria was reduced from eight to four. When the frequencies of foremen falling in five categories on the original and revised instrument are compared, the effect of removing non-differentiating items can be seen. These categories are calculated by dividing the total possible range of scores into five equal groups; the names are derived from past company practice. Number of foremen in each category (with percent of total in parentheses) Original BOS Scales (1 ) (2) (3) (4) (5)

Below Adequate Adequate Full Excellent Superior

Revised BOS Scales 3 (3.3%) 24 (26.7%) 50 (55.6%) 13 (14.4%)

15 (16.7%) 59 (65.5%) 16(17.8%)

TABLE 5 Intercorrelations among the Four BOS

Criterion I1 Criterion 111 Criterion IV

Criterion I

Criterion I1

.57 .I4 .59

.65 .69

Criterion 111

.70

.92 .85 .85 .80

.95

308


Grade point analogy. The intercorrelations among the revised four BOS indicate that different aspects of the foremen’s performance were being measured by each of the four criterion scales (see Table 5 ) . Variation on any one criterion scale accounted for no more than half the variation in another criterion. Since each criterion scale was comprised of a different number of items (21, 8, 14, and 11, respectively) the question of weighting the scales was considered. The grade point average (GPA) analogy was adopted. College students receive a grade ranging from 0 to 4.0 for each course they take in most universities. A grade point average (overall performanbe rating) is computed by averaging across all courses regardless of the number of exams (items) used in each course (job criterion). That is, given that all courses are equal in credits, each course grade is weighted equally. Raw scores were used in the present study to compute the GPA for each foreman. The mean GPA resulting from this equal weighting was 2.69 (SD = .64).Giving each criterion equal weight is compatible with research in selection (Lawshe, 1959; Trattner, 1963) which has shown that sophisticated weighting of predictors (e.g., using multiple regression) seldom yields higher validities than simply adding up the individual predictor scores. Moreover, it allows the supervisor to use his “expert judgment” to take into account prevailing conditions (e.g., general economy, company’s competitive position, distribution of skills within the company) at the time that a decision based on a composite evaluation (e.g., promote, layoff) is required (Guion, 1961). Discussion

The advantages of using behavioral observation scales for conducting performance appraisals include the following: 1. BOS are developed from data supplied by the users for the users. Thus, understanding of and commitment to the use of the scales are facilitated. The frequently heard complaints by supervisors and subordinates alike that the items on the appraisal instrument are either sufficiently vague to defy understanding, or completely inappropriate for the individual’s appraisal are minimized. 2. The BOS are content valid. All the behaviors differentiating the successful from the unsuccessful performer are included on the scales. The appraiser is forced to make a thorough evaluation of an employee rather than emphasizing only those points that he can recall at the time of the appraisal. 3. The BOS can either serve alone or as a supplement to existing job descriptions in that they make explicit what behaviors are required of an employee in a given job. As a job description, the BOS can also be used as a “job preview” for potential job candidates by showing them


309

what it is they will be expected to do. Job previews are an effective means of reducing employee turnover and job dissatisfaction (Wanous, 1973). They assist the candidate in making a decision as to whether he would want t o consistently demonstrate the behaviors described by the BOS. 4. The BOS can facilitate explicit performance feedback in that they encourage meaningful discussions between the supervisor and the employee of the latter’s strengths and weaknesses. Generalities are avoided in favor of specific overt behaviors for which the employee is praised or is encouraged to demonstrate on the job. Explicit performance feedback combined with the setting of spec$% goals has been shown repeatedly to be an effective motivator for bringing about and/ or maintaining a positive behavior change (Latham and Yukl, 1975; Latham, Mitchell, and Dossett, 1978). 5. The BOS satisfy EEOC Guidelines in terms of validity (relevance) and reliability. In the present study the content validity, interjudge agreement of the categorization system, and the internal consistency of the scales themselves were found satisfactory. In previous studies (Latham and Wexley, 1977; Latham, Wexley, and Rand, 1975; Ronan and Latham, 1973) the test-retest and interobserver reliability, as well as the concurrent validity of the BOS with employee attendance and other cost related measures (e.g., productivity) were demonstrated. Criterion bias is avoided in that unlike using the BES, supervisors do not have to extrapolate from (a) what they have observed to (b) the placement of a checkmark beside an example that may or may not be appropriate. Empirical comparisons between BES and BOS have yet to be made. However, a rational comparison suggests that the use of BOS avoids the following problems with BES summarized by Atkin and Conlon (1978). 1. Explicit endorsement of an incident above the neutral point of BES implies endorsement of all other incidents between the incident in question and the neutral point. This endorsement, which may be unwarranted, is avoided with BOS because the rater is allowed to evaluate an individual on each and every item. A central tendency and a range can be determined for the individual on the criterion in question. 2. The criterion of “critical” is minimized in the generation of the behavioral items for BOS. Rather, emphasis is placed on developing an inventory of behaviors, rating an individual on the frequency with which each behavior is emitted, and conducting an item analysis against an internal o r external criterion for determining the items that should compose the final rating instrument. 3. Related to the above criticism of BES is the point that standard

310


or non-critical behavior may not be processed and stored in the same way as non-standard behavior. Hence, at the time of the rating, raters may not have enough information about the performance of standard behaviors to use them in the BES context. The BOS, however, serves as a checklist for both the rater and the ratee to take into account in their respective day-to-day job functions. That is, the manager knows what s/he should be “alert to” in observing an employee and the employee knows explicitly what the boss “is looking for.” 4. Consistent with problems surrounding the use of judges to develop Thurstone scales, Atkin and Conlon suggest that to the degree to which a particular supervisor believed that a particular dimension is substantially more important than others, s/he would tend to define a relatively narrow range of acceptable behaviors, a relatively broad set of unacceptable behaviors, and a virtually null set of neutral behaviors. All the rater is required to do on BOS is to indicate the frequency with which s/he has observed the behavior. The behaviors that s/he is to observe are listed on the scale. Essentially, the choice of BOS versus BES can be reduced to a preference for Likert versus Thurstone scales. Empirical comparison of these two scales in the area of attitude measurement has been fairly conclusive in showing the superiority of the Likert scale in terms of reliability. It is unlikely that a different conclusion will be reached in the area of performance appraisal. REFERENCES Anastasi, A. Psychological testing. New York: New York: Macmillan, 1976. Atkin, R. S. and Conlon, E. J. Behaviorally anchored rating scales: Some theoretical issues. Academy of Management Review, 1978, 3, 119-128. Campbell, J. P., Dunnette, M. D., Lawler, E. E. 111, and Weick, K. E. Managerial behavior, performance, and effectiveness. New York: McGraw-Hill, 1970. Flanagan, J. C. The critical incident technique. Psychological Bulletin, 1954, 51, 327358. Guion, R. M. Criterion measurement and personnel judgments. PERSONNEL PSYCHOLOGY, 1961, 14, 141-149. Latham, G. P. and Mitchell, T. R. Behavioral criteria and potential reinforcers for the engineer/scientist in an industrial setting. JSAS Caralog of Selected Documents in Psychology. 1976, 6, 83 (Ms. No. 1316). Latham, G. P., Mitchell, T. R., and Dossett, D. L. The importance of participative goal setting and anticipated rewards on goal difficulty and job performance. Journal of Applied Psychology, 1978, 63, 163-171. Latham, G. P., & Wexley, K. N. Behavioral observation scales. PERSONNEL PSYCHOLOGY, 1977, 30, 255-268. Latham, G. P., Wexley, K . N., and Pursell, E. D. Training managers to minimize rating errors in the observation of behavior. Journal of Applied Psychology, 1975, 60,550555. Latham, G. P., Wexley, K . N., and Rand, T. M. The relevance of behavioral criteria developed from the critical incident technique. Canadian Journal of Behavioural Science, 1975, 7, 349-358.


31 1

Latham, G. P. and Yukl, G . A. A review of research in the application of goal setting in organizations. Academy of Management Journal, 1975, 18, 824-845. PSYLawshe, C. H. Statistical theory and practice in applied psychology. PERSONNEL CHOLOGY, 1959, 22, 117-124. Likert, R. A. Technique for the measurement of attitudes. Archives ofPsychology, 1932, No. 140. Nagle, G . F. Criterion development. PERSONNEL PSYCHOLOGY, 1953, 6, 271-289. Ronan, W. W. and Latham, G. P. The reliability and validity of the critical incident technique: A closer look. Studies in Personnel Psychology, 1974, 6, 53-64. Schwab, D. P., Heneman, H., and DeCotiis, T. Behaviorally anchored rating scales: A review of the literature. PERSONNEL PSYCHOLOGY, 1975, 28, 549-562. Smith, P. C. and Kendall, L. M. Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 1963, 47, 149-155. Thurstone, L. L. Theory of attitude measurement. Psychological Bulletin, 1929, 36,224241. Trattner, M. H. Comparison of three methods for assembling aptitude test batteries. PERSONNEL PSYCHOLOGY, 1963, 16, 221-232. Vroom, V. H. and Maier, N. R. F. Industrial social psychology. Annual Review of Psychology, 1961, 12, 413-446. Wanous, J . P. Effects of a realistic job preview on job acceptance, job attitudes, and job survival. Journal of Applied Psychology, 1973, 58, 327-332. Wexley, K. N. and Yukl, G. A. Organizational behavior and personnel psychology, Homewood, Illinois: Irwin-Dorsey, 1977.