exploring the boundary conditions for interview ...

17 downloads 3790 Views 1MB Size Report
rationally, based on the judgment of the developer of the interview. In the interview .... divided into A, B, and C groups, and employers are urged to hire only As.
PERSONNEL PSYCHOLOGY 1999,52

EXPLORING THE BOUNDARY CONDITIONS FOR INTERVIEW VALIDITY: META-ANALYTIC VALIDITY FINDINGS FOR A NEW INTERVIEW TYPE FRANK L. SCHMIDT, MARK RADER Department of Management and Organizations University of Iowa

This study uses meta-analysis of an extensive predictive validity database to explore the boundary conditions for the validity of the structured interview as presented by McDaniel, Whetzel, Schmidt, and Maurer (1994). The interview examined here differs from traditional structured interviews in being empirically constructed, administered by telephone, and scored later based on a taped transcript. Despite these and other differences, this nontraditional employment interview was found to have essentially the same level of criterion-related validity for supervisory ratings of job performance reported by McDaniel for other structured employment interviews. These findings suggest that a variety of different approaches to the construction, administration, and scoring of structured employment interviews may lead to comparable levels of validity. We hypothesize that this result obtains because different types of structured interviews all measure to varying degrees constructs with known generalizable validity (e.g., conscientiousness and general mental ability). The interview examined here was also found to be a valid predictor of production records, sales volume, absenteeism, and job tenure. The employment interview is second only to the application blank in frequency of use as a selection procedure (American Society for Personnel Administration, 1983; Ulrich and Ti-umbo, 1965). Recent research indicates that employment interviews, used properly, have higher validity for predicting job performance than previously believed and that validity is generalizable (McDaniel, Whetzel, Schmidt, & Maurer, 1994; see also Huffcutt & Arthur, 1994). However, in any area of research that is not guided by strong and well developed theory, it is important The authors thank the Gallup Organization for providing access to the validity data used in this research and for cooperation and assistance in other phases of the research. We particularly thank Donald Clifton, Chairman of the Board, and Richard Harding, Senior Vice President for Research, for their help and for their useful comments on an earlier version of this study. The authors also thank Michael McDaniel for providing detailed information on the McDaniel et al. (1994) meta-analysis beyond the material contained in the published article. Correspondence and requests for reprints should be addressed to Frank Schmidt, Department of Management and Organizations, College of Business, University of Iowa, Iowa City, IA 52242; [email protected]. COPYRIGHT © 1999 PERSONNEL PSYCHOLOGY, INC.

446

PERSONNEL PSYCHOLOGY

to investigate the boundaries of known empirical generalizations. In the criterion related validity area, this requirement is particularly important for selection procedures such as the interview, evaluations of education and experience, biographical data measures, and assessment centers; that is, for procedures which have not been shown to measure a particular or well defined construct, making it difficult or impossible to base validity extensions on theory and developed nomological networks of correlations among constructs in the cumulative research literature (Schmidt & Rothstein, 1994). The employment interview is a measurement procedure, as are paper and pencil tests of ability or aptitude. However, meta-analyses of the validity of paper and pencil tests are conducted separately for each ability construct measured by such tests, in order to examine the validity associated with measures of particular constructs. Employment interviews may measure a mixture of constructs, such as interpersonal skills, cognitive ability, and work motivation (Hunter & Hunter, 1984), but it is not possible to conduct separate meta-analyses for each construct (or even particular combinations of constructs) because knowledge is lacking of which constructs are measured and in which interviews. Therefore, for such predictors, conclusions from meta-analyses apply to general methods rather than to constructs (Schmidt & Rothstein, 1994). Hence the validity of interviews and similar procedtires must, at least at present, be established by relying more heavily on an empirical generalization approach and less heavily on a broader network of theory and related empirical findings (i.e., established nomological nets of correlations among constructs). This increases the importance of empirically establishing the boundary conditions for validity for such selection methods (Schmidt & Rothstein, 1994). The present study is an investigation of the boundary conditions of the validity of structured employment interviews. Based on 107 validity estimates, this study examines whether the validity findings of McDaniel et al. (1994), the most comprehensive meta-analysis of interview validity to date, extend to a nontraditional type of structured employment selection interview. Scores on this interview have not yet been correlated with personality or cognitive measures, so such information cannot be used to explicate its general construct validity; the same is generally true of traditional structured interviews (McDaniel et al., 1994). The interview examined in this paper differs from the traditional structured interviews studied in McDaniel et al. (1994) in several important respects; each of these differences is potentially useful in determining the boundary conditions for the validity conclusions reached in McDaniel et al. (1994). The first difference is the manner in which interview questions are developed and scored. In traditional structured

SCHMIDT AND RADER

447

interviews, interview questions are developed based on a task-oriented job analysis. That is, the developer examines the tasks that make up the job and creates questions intended to reveal applicant standing on the knowledges, skills, abilities, and other traits (KSAOs) judged to be determinants of performance on the tasks making up the job. (Situational and behavior description interviews, however, are often based on critical incidents rather than a task-oriented job analysis.) The procedure for the scoring of applicant responses to these questions is also determined rationally, based on the judgment of the developer of the interview. In the interview examined here, the questions and their scoring are developed empirically, in the following manner. First, the job description for the job in question is examined. Next, the researcher visits the organization and observes employees performing on the job. Then a group of outstanding performers nominated by the employer are interviewed in depth by researchers to define and clarify the major functions and responsibilities of the job and to identify behavioral tendencies or traits (called themes) that appear to characterize these top performers and to be related to high-level job performance. (If group members disagree on a theme, it is retained in the group of themes to be tested empirically in a later step.) For example, a frequent theme is teamwork. An employee high in teamwork helps and cooperates with others, works to meet common goals, shares information with others, etc. Another example is the developer theme. An employee high on this theme dimension coaches and teaches others, makes efforts to find out what people can do and helps them develop and grow, and so forth. A third example is goal focus. An employee high on this theme has clear goals, is self-directed, has a longer term view, and stays on target over time. (Research is currently underway on the factor structure of these themes and the relation of these factors to the Big Five personality traits and other personality dimensions.) Typically, about 10 to 15 themes are identified for any one job; themes differ across jobs. Next, an initial set of potential interview questions is developed to measure each of the identified themes. For example, for the developer theme a typical question might be "Suppose you notice that one of your co-workers is making errors because he or she doesn't know the right sequence of work steps. What would you do?" For any one job, the total number of questions at this point across all themes is approximately 120. Typically about 20% of these questions are new and the remainder are questions known from past studies to be good measures of specific themes. A reviewer inquired whether the questions used are situational (as is the one above) or whether behavior description questions are used. Both can be used but many questions used do not fit into either of these categories. For example, "How competitive are you?" or "How do you

448

PERSONNEL PSYCHOLOGY

feel when someone doubts what you say?" As these two examples show, the range of questions can be broader than is found in typical situational and behavior description interviews. (All questions of this sort are followed up with questions asking the applicant for specific examples: "Tell me about a time when you were very competitive.") Another reviewer inquired as to typical levels of intercorrelations among themes. These intercorrelations typically range from -.10 to -I-.50. For example, the themes empathy and relator have an average intercorrelation of .51. (Because individual theme measures have limited reliability, these correlations can be expected to be larger at the construct level.) An example of a high negative correlation is that between the uniformity theme (tendency to see all people as the same) and the individualized perception theme (tendency to see people as unique). This correlation averages -.50 (and is undoubtedly close to 1.00 at the true score, or construct, level). Next, the employer is asked to nominate a group of current employees who are outstanding performers and a group who are less than satisfactory performers. (In some cases, the contrast group consists of "average" performers.) '^ically there are about 30 in each group. These individuals are interviewed individually; their responses to the initial set of about 120 questions are recorded and transcribed. The questions (and themes) that are retained for the final interview are those that show the largest differences (d-values) between the high and low performing groups. Required d-values are usually about .50; these effect sizes, and not significance tests, are used to select questions. The difficulty of the question is also considered, with very difficult and very easy questions being dropped. However, if both the high performing and low performing groups score high on a theme indicated by the job analysis phase to be important for minimally acceptable performance, the possibility that the lack of difference is due to range restriction on basic performance requirements is considered and a judgment may be made to retain such a question. Typically about 60 of the original 120 questions are retained; the content of these questions differs markedly from job to job. The retained questions typically represent about 10 to 12 themes, which also vary across jobs. Therefore it is not expected that any given interview would be valid for all jobs, but rather only for the job for which it is constructed. In this respect, the interview type does not differ from traditional structured interviews. This empirically based process also establishes the scoring protocol for the questions. This protocol is not an objective scoring key but rather a guide to be used in judging whether any given answer should be credited as showing the presence of a given theme. For example, the scoring guide presents responses that exemplify the courage theme (willingness

SCHMIDT AND RADER

449

to express an opinion that might be unpopular or sanctioned by others). Scoring of any given response is positive and dichotomous. That is, if the response is judged to indicate the theme, the applicant is assigned a point; otherwise no credit is given. (There is no provision for negative points.) This procedure is similar to the empirical procedure used in the initial phase of construction of the behavioral consistency method of evaluating past achievements and experiences (McDaniel & Schmidt, 1988; Schmidt, Caplan, Bemis, Decuir, Bunn, & Antone, 1979). Additional information on scoring procedures can be found in Selection Research Standards (Gallup, 1996, Sec. B). Applicants are administered this interview and then asked to evaluate the interview by responding to two questions: "How do you feel about answering these questions?" and "How well do you feel this interview gave you an opportunify to describe yourself?" On a 5-point scale (where 5 is the highest rating), interviewee ratings average between 4 and 5, despite the fact that no explanation is given for the kinds of questions asked or for why the interview is conducted by telephone. Tliis rating by applicants does not vary systematically across job types. Total scores on this interview have been found to show no differences or only very small differences by race or sex. Administration time for the typical interview is about 30 minutes. Scoring of the interview tape by a trained analyst requires about 1 hour (including the short written report that is provided to employers). When managers (rather than Gallup analysts) score the interviews, they are first each put through a 3-day training program, which includes comparing their scoring results with those of trained analysts to calibrate accuracy. These checks are run throughout the first year of use, and at 1-year intervals thereafter. The minimum requirement is 85% agreement with the trained analyst's scoring of the interview. To become a Senior Analyst, an analyst must score 500 interviews under the supervision of a Senior Analyst, with at least 85% agreement with the Senior Analyst's scoring. For convenient use by the employer, the total score distribution is divided into A, B, and C groups, and employers are urged to hire only As (the highest group) if possible, and in any event to avoid hiring those in the C group. In addition, a short written evaluation of each candidate is prepared and provided to the selecting officials, who are given training in how to interpret the interview results, including the written report. Employers are told that the interview should be used along with other available job-related information about the applicant. This is clearly a different approach to construction of the structured interview than is typically used. No empirically constructed interviews were included in McDaniel et al. (1994; McDaniel, personal communication, March 18, 1996). An important question concerning boundary

450

PERSONNEL PSYCHOLOGY

conditions for these earlier findings is whether they generalize to empirically constructed interviews. On one hand, because of the way questions are selected and scored, one might expect empirically constructed interviews to have higher validity, because only questions found to correlate with job performance are retained in the final interview. On the other hand, the process of question selection is susceptible to capitalization on sampling errors in the original sample and therefore susceptible to validity shrinkage when used with a new and independent sample (i.e., actual applicants to the job). These two processes work in opposite directions and may cancel each other out. The second major difference is that the interview studied here is administered by telephone. This process works as follows. At a time previously agreed upon, the applicant receives a telephone call from an interview administrator at the employing organization (or at the consulting organization, depending on the arrangement). The interview is administered by telephone and recorded for later scoring based on the scoring guide. The administrator reads each question carefully to the applicant, allows time for the answer, and makes comments such as "Would you like me to read the question again?" and "Is there anything else you would like to add?" to encourage full development of responses. There are no time limits. The administrator adheres to the exact wording of the questions; he or she may repeat questions but does not rephrase them or attempt to explain their meaning or intent. (The rationale here is to make the interview objective and standardized for each interviewee; i.e., to prevent the interviewer from injecting his or her own biases into the question.) The recorded responses are later scored independently by one or more different individuals who have been specially trained in the scoring process. Scoring is checked by examining agreement between independent scorers. A reviewer inquired why this procedure could not be administered in written form, as opposed to an oral interview. There are two reasons. First, the interview attempts to obtain the interviewee's first reaction to each question, not the response that would be produced after reflection and thought. The applicant's initial reaction is hypothesized to be more valid, because the absence of reflection does not allow applicants to try to guess the "correct" answer. This spontaneous aspect would be lost with a written format. Second, for many jobs, interviewees lack the requisite writing skills. The interviews examined in earlier research (McDaniel et al., 1994; Huffcutt & Arthur, 1994) have all been conducted face-to-face (McDaniel, personal communication, March 18, 1996). Conducting employment interviews by telephone can produce cost savings because applicants in widely scattered geographical locations can be interviewed

SCHMIDT AND RADER

451

with no travel costs for interviewees or interviewers. For example, if a client has branches or stores in 50 or more U.S. cities, applicants for all locations can be interviewed from one central office. However, it is not clear what the effect of telephone administration, if any, on validity might be. On the one hand, subtle facial expressions of the interviewer can no longer be used by the applicant as cues suggesting the desired answer, perhaps improving the validity of the responses. The physical attractiveness of the applicant cannot influence scores or ratings. It is also possible that many people are less guarded and more open and honest when speaking on the telephone, in comparison to face-to-face interactions. On the other hand, it is possible that the intended meaning of the questions might be less clear, especially given that the administrator of the interview does not rephrase or otherwise elaborate on the intended meaning of the questions. So it is not clear what the expected impact on interview validity would be. However, it is clear that these procedures are different enough from those typically used in structured interviews to be useful in testing the boundary conditions of interview validity.^ The next way that the present study differs from the previous metaanalyses that have been used to establish the validity of the employment interview is the fact that all of the validity estimates examined here are derived from the interview procedure described above. This fact leads to a straightforward (but secondary) prediction: The standard deviation of interview validities should be smaller here (holding criterion type constant) than was found to be the case in McDaniel et al. (1994). Although the interviews examined in the present study are not all identical (as explained earlier), they are all based on the procedure described above and are more similar to each other on many dimensions than the interviews analyzed by McDaniel et al. The cumulative impact of differences across studies (interviews) in the McDaniel et al. meta-analysis would be predicted to produce variability of validities across studies greater than would be expected in the present study (Schmidt & Rothstein, 1994). The present meta-analysis examines criterion measures not included (or not included as separate categories) in previous meta-analyses of interview validity: production records, sales data and absenteeism. These criterion measures also speak to the boundary conditions of interview validity, albeit only with respect to the present interview type. The present study is the first meta-analyses of interview validity for these three criterion measures. An important requirement in some jobs is effective face-to-face communication. In the case of such jobs, the interview described here is used as a first screen, with a face-to-face interview being administered later to those with the highest interview scores. This second interview does not assess the full range of themes, but focuses only on oral communication skills.

452

PERSONNEL PSYCHOLOGY

The traditional criterion of supervisory ratings of performance are also included in the present meta-analysis. Because this is the criterion measure predominantly used in previous meta-analyses, the major validity comparison for the findings of this study with those of previous meta-analyses will necessarily be in terms of supervisory ratings criteria. However, the present study also reports meta-analytic validity estimates for the criterion of job tenure, which was also examined in McDaniel et al. (1994). So a comparison of findings for this criterion can also be made, with limitations that are discussed later (McDaniel et al. 1994, also included the criterion of performance in training programs, which was not available for this study.) Method The interview described above was developed by a large consulting company and has been used in a wide variety of client firms. It is described in detail in Gallup (undated) and Selection Research, Inc. (1992). As noted above, because of the empirical method of development, the validity of this interview in the validation (developmental) sample is inflated by capitalization on chance sampling error configurations in that sample. Unbiased estimates of the validity of the interview require estimating validity in a new, independent sample (crossvalidation sample). Therefore only data obtained on new samples could be used in this meta-analysis. These data were obtained from follow-up predictive validity studies; that is, after the development of the interview was complete, it was used in selecting new applicants. After a period of time, the job performance of these new employees was measured and the correlation between interview scores at the time of hire and later job performance was determined. Although validity estimates of this (predictive) type are generally ideal for estimating the predictive (operational) validity of a selection instrument, they are affected by restriction in range. However, the downward bias introduced by range restriction was corrected for in the meta-analysis. We were able to obtain 209 cross-validated predictive validity coefficients, all from unpublished studies. These studies were conducted in a wide variety of client firms (e.g., manufacturing firms, public schools, insurance companies, and financial services organizations). Jobs included skilled and semiskilled workers in manufacturing, managers, public school teachers, life insurance sales, computer sales, financial service sales, and professional sports team members. This body of data included all predictive validity studies conducted by researchers at Gallup up to that time. In many cases, several validity coefficients were computed on the same sample, and thus were not statistically independent. Typically,

SCHMIDT AND RADER

453

TABLE 1 Criterion Types and Number of Studies for Each Criterion Type Number of coefficients Performance ratings Sales jobs Tfeaching, interviewers Production jobs Managerial jobs Social work Subtotal Turnover (tenure) data Sales data Insurance sales Computer sales Financial services Other sales Subtotal Production records Absenteeism Grand total

12 12 5 3 1 33 21 7 18 3 13 41

5 7 107

this occurred because the same sample included multiple measures of job performance (e.g., several measures of sales or ratings of several dimensions of job performance). The ideal procedure in such cases is to compute the correlation of the predictor (the interview score) with the sum of the performance measures, yielding a more precise estimate of validity (Hunter & Schmidt, 1990, pp. 457-463). However, computing this correlation for studies that did not report it requires knowledge of the correlations among the performance measures; in this data set, this information was available only for one study. In other cases, this absence of data forced use of the average correlation within the sample (along with the average sample size). This procedure causes a downward bias in validity estimates (Hunter & Schmidt, 1990, pp. 451^54) but was unavoidable. In certain cases in which there were multiple criterion measures, it was apparent that one measure was more construct valid than the others; in such cases, that measure was used alone, and an average correlation was not computed. This occurred mostly in sales jobs; for such jobs, dollar volume of sales was considered by Gallup consultants to be a more construct valid measure of performance than other available measures, such as percentage of sales quota met. (Sales quotas were reported to have been based on subjective judgments; however, observed validities did not differ as a function of this distinction.) The final number of independent validity estimates was 107. Table 1 shows the distribution of these validity estimates across criterion measures and types of jobs.

454

PERSONNEL PSYCHOLOGY

The meta-analysis procedure used was the interactive procedure (Hunter & Schmidt, 1990, pp. 182-187) with several refinements that have recently been shown analytically and via computer simulation to increase accuracy (Hunter & Schmidt, 1994; Law, Schmidt, & Hunter, 1994a; 1994b; Schmidt et al., 1993). Although the differences are often small, this procedure has been shown by computer simulation to be generally more accurate than other available procedures for artifact distribution based meta-analysis (Law et al., 1994a, 1994b). Artifact Distributions Predictor reliability distribution. Estimates of the reliability of the interview available from the Gallup Organization consisted of correlations between scores produced by different scorers for the same administration of the interview; that is, estimates of scorer agreement (or conspect reliability). These were quite high (.80-.90) but are not appropriate measures of the reliability of the interview per se, as explained in McDaniel et al. (1994) and Schmidt and Hunter (1996). Estimates of the correlation between scorers across re-administrations of the interview to the same interviewees were not available. This meta-analysis employed the empirically derived reliability distribution from McDaniel et al. for the structured interview; it has a mean of .84 and a standard deviation of .15. The values in this distribution do not affect the estimate of mean true validity. This distribution functions only to estimate the contribution of variations in predictor unreliability to the observed variance of the validities. (It is known from previous studies that this particular contribution is quite small; Hunter & Schmidt, 1990.) No correction for attenuation due to predictor unreliability is made to individual validity coefficients or to the mean validity coefficient. Although the same general procedure (described above) was used to construct each interview, the resulting interviews varied in number and type of questions, in the sample size they were derived on, and on other dimensions; these factors would be expected to create variations in interview reliability. Artifact distributions are shown in Table 2. Range restriction. As noted earlier, the studies examined here are predictive validity studies; interview scores were used in hiring the subjects. The employers in question relied on the interview in selecting new employees and hence considerable range restriction can be expected. However, the figures needed to calibrate degrees of range restriction were not maintained. McDaniel et al. (1994) empirically developed a distribution of range restriction for employment interviews from the studies they examined. That distribution was used in the present metaanalysis and is shown in Table 2.

SCHMIDT AND RADER

455

TABLE2 Summary of Artifact Distributions Used

Predictor reliability (From McDaniel et al,, 1994) Criterion reliability Performance ratings iSales data Production records Absenteeism Tenure For combined analyses: Ratings and production data Ratings, production data, and sales data For analyses by job type within criterion type: Sales data Insurance sales Computer sales Other sales Range restriction" (From McDaniel et al,, 1994)

Number of entries 100

Mean ,84

,15

100 39 5 7 26

,50 ,82 ,99 ,64 1,00

,16 ,17 ,005 ,16 0

39 80

,56 ,69

,16 ,17

7 17 12 14

,81 ,91 ,81 ,68

,12 ,09 ,17 ,16

SD

"These values are ratios of the restricted predictor standard deviation (the SD among those hired) to the unrestricted predictor standard deviation (the SD in the applicant pool). These values were obtained by McDaniel et al, (1994) from empirical studies of interview validity.

Criterion reliability distributions. Separate criterion reliability distributions were developed for each criterion type, based on cumulative research findings in the literature. The mean of the distribution of (interrater) reliabilities for supervisory ratings of overall job performance was based on the finding by Rothstein (1990) and Viswesvaran, Ones, and Schmidt (1996) that average agreement between two knowledgeable raters is approximately .50. The variability around this mean was that used in Pearlman, Schmidt, and Hunter (1980); that is, SD = .16. Job performance ratings in this data set were all based on single raters, typically first-line supervisors. The reliability figures for production records and sales data were based on the findings of the extensive review by Hunter, Schmidt, and Judiesch (1990). This study found that output and sales measures vary in test-retest reliability depending on the length of the time period over which they are measured. For example. Hunter et al. found that life insurance sales measured over a 13-weelc period had a test-retest reliability of .49; the stability of life insurance sales when measured over 1 year was .80. Amount sold for other types of sales jobs had somewhat higher

456

PERSONNEL PSYCHOLOGY

reliability: .68 for 13 weeks and .89 for 52 weeks. In our data, the time period of measurement was known for all but 2 of the 41 validities for the sales criterion. We used the Spearman-Brown formula to estimate the reliability for each of the 39 coefficients for which the time period was known; for the remaining two, we used the average time period to make these estimates. Reliability figures for life insurance sales from Hunter et al. were used only for life insurance sales criteria; reliability figures for other sales jobs were used for other sales data. Separate distributions of criterion reliabilities were created for computer sales, financial services, and other sales. For production records. Hunter et al. found a mean test-retest reliability of .55 for 1 week and .83 for 4 weeks. Again, the Spearman-Brown formula was used to estimate reliability for the five studies in our analysis that used production records. Because the time period of measurement was fairly long in these studies, these values are all close to 1.00 (meaning there was essentially no correction for criterion unreliability in these studies). Again, these artifact distributions are shown in Tkble 2. Ones, Viswesvaran, and Schmidt (1993) found the average (testretest) reliability for general absenteeism measures to be. 17 for 1 month. This value was used with the Spearman-Brown formula and the specified time interval to calculate estimates of criterion reliability for the seven studies that used the criterion of absenteeism. The resulting values are shown in Ikble 2. Finally, we could locate no studies reporting reliability estimates for tenure or turnover. Therefore no correction for criterion unreliability was made in the 21 studies that employed the criterion of job tenure. This forced assumption of perfect reliability for tenure and turnover measures results in underestimation of validity for predicting this criterion (Schmidt & Hunter, 1996; see Scenario 4), but this is unavoidable. In the meta-analyses that included several criterion measures, the criterion reliability distribution used was created by combining the above criterion reliability distributions (e.g., for supervisory ratings, sales data, and production records) in a particular way: Only the mean values for each type of criterion and the corresponding frequency for that type of criterion measure were entered into the distribution. Hence the variability of values within each type of distribution was not reflected in the combined distribution. This procedure is conservative; it does not correct for variance in observed validities created by variability of reliability within each criterion type; hence, it overestimates standard deviations of true validities and underestimates the generalizability of validity.

SCHMIDT AND RADER

457

TABLE 3 Meta-Anafytic Results for Criterion Types Criterion type

N

K

f

SIX

sq.,,

p

SD,, 90%C.V.

1. Supervisory ratings 2539 33 .19 .15 .09 .40 .19 .16 2. Ratings & prod, records 2963 39 .21 .16 .10 .40 .18 .16 3. Ratings, prod, records, 9493 79 .17 .13 .08 .30 .14 .11 and sales 4. Production records 424 5 .29 .14 .07 .40 .10 .28 5. Sales performance 6535 41 .15 .11 .07 .24 .11 !lO 6. Absenteeism (reversed) 660 7 .10 .05 0 .19 0 .19 7. Job tenure 1755 21 .28 .18 .13 .39 .18 .15 Note: N = number of observations; K = the number of correlations; f = the mean observed correlation; SDr = the standard deviation of observed correlations; Sft.,, = the residual standard deviation; p = the estimated mean true validity; SDp = the standard deviation of the true validity; 90% C. V = the lower 90% credibility value.

Results and Discussion

Thble 3 shows the meta-analysis results by criterion type. A major purpose of this study is to examine the boundary conditions for earlier meta-analytic findings for the validity of interviews. In the McDaniel et al. (1994) meta-analysis of structured interview validity, most of the measures of overall job performance were based on supervisory ratings (87%). Therefore the first row of Tkble 3, which shows the validity of the empirically constructed interview for predicting supervisory ratings of overall job performance, is of particular interest. Based on 89 studies, McDaniel et al. obtained an estimated mean true validity of .44 for structured interviews. (See their Tkble 4, p. 606.) Our comparable estimate for the mean true validity of the empirically constructed interview administered by telephone is .40. For this analysis, McDaniel et al. found an SDp value of .28, while our value in row 1 of Tkble 3 is .19. Thus the mean validity estimates are quite similar, although the SDp value is smaller in the present study, as predicted based on the greater similarity across interviews in our analysis. As a result of the smaller SDp value, the 90% credibility value is larger in the present study: .16 in Table 3 versus .07 in McDaniel et al. McDaniel et al. (1994) distinguished between supervisory job performance ratings obtained solely for research purposes and those available from personnel files (administrative ratings). The latter were obtained for routine performance appraisal purposes and might not have been as accurate as those obtained solely for research purposes. Separating on this factor, they obtained different estimates for the mean true validity of structured interviews: .37 (SDp = .24) for administrative supervisory ratings (50 studies) and .51 (SDp = .31) for research-only supervisory ratings (36 studies; see their Table 4, p. 606). They presented evidence

458

PERSONNEL PSYCHOLOGY

to support their conclusion that the .51 obtained with the research-only ratings was probably the more accurate estimate of the real validity of the traditional structured interview. However, in the present set of studies, both types of ratings were used, and so the most relevant comparison with the McDaniel et al. findings is with the combined sample mean estimate of .44. However, in passing we note that in both of their separate analyses (by type of rating), the McDaniel et al. SDp values (.24 and .31) are larger than any of the values observed in the present study. So the larger SDp in McDaniel et al. is not caused by pooling across administrative and research-only supervisory ratings. (The same calculational procedures for meta-analysis were used in both studies.) In comparing the mean true validity estimate for supervisory ratings of .40 in Tkble 1 with the corresponding McDaniel et al. (1994) value of .44, the reader should remember that all of the studies in the present meta-analyses are predictive studies. In the McDaniel et al. study set, 61% were predictive and 39% were concurrent. (The study design of 10% could not be determined.) Although for cognitive measures predictive and concurrent validity estimates have been found to be equal (Society for Industrial and Organizational Psychology, 1987), this has not always been the case for noncognitive measures. For example. Ones et al. (1993) found that in the case of integrity tests concurrent validity estimates were somewhat larger than predictive estimates. Hence the small difference of .04 observed here could be due to the fact that at least about 40% of the studies included in the McDaniel et al. meta-analysis of structured interview validity were concurrent studies. The second row in Tkble 3 shows results obtained when studies measuring job performance via production records are added to those using supervisory records. This analysis is relevant to the question of boundary conditions because some of the studies based on the job performance criterion in McDaniel et al. (1994) used production records as the criterion measure. In McDaniel et al., 13% of such studies used production records (McDaniel, personal communication, March 18,1996), a figure very close to the 15% in row 2 of our Tkble 3. Findings are essentially identical to those for supervisory ratings alone: the estimated mean true validity remains the same at .40 and the 90% credibility value remains the same at .16. Again, findings for mean true validity are similar to those of McDaniel et al. (1994) for traditional types of structured interviews. The unusual features of the present interview appear to have no real effects on validity. Again, the SDp value, at .18, is smaller than the McDaniel et al. SDp value of .28, as predicted. Next we added studies of sales jobs that measured job performance using the criterion of amount sold. This meta-analysis includes all studies using the criterion measures of supervisory ratings, production

SCHMIDT AND RADER

459

records, and sales records. Shown in line 3 in Tkble 3, these findings indicate a somewhat smaller mean true validity estimate (.30), but with a SDp value small enough to still permit generalization of validity (90% credibility value of .11, still larger than the McDaniel et al value of .07). The McDaniel et al. (1994) meta-analysis did not include measures of sales, so this mean validity is not directly comparable to any from that study. The remaining rows in Tkble 3 show findings for individual criterion types. For the criterion of output as measured by production records alone, the mean true validity estimate is .40 (the same as for supervisory ratings) and the SDp value is a relatively small .10. To our knowledge, this is the first meta-analysis specifically examining the validity of any type of interview for predicting output as measured by production records. Although based on only 5 studies and 425 workers and therefore susceptible to second-order sampling error, these findings suggest that validity findings do not differ as between supervisory rating criteria and objective output measures, a potentially important finding for the interview if confirmed by future research. Ones, Viswesvaran, and Schmidt (1993) reported a similar finding for integrity tests. The validity of integrity tests for the criterion of output measured by production records was only slightly below that for supervisory ratings (.28 vs. .35; see their Tkble 5, p. 685). Based on 41 studies and 6,535 employees, the interview appears to predict amount sold in sales jobs with an estimated mean true validity of .24, and a 90% credibility value of .10. This meta-analysis is also the only one of its type of which we are aware, so it is again not possible to compare these findings with those for the traditional structured interview. The results for the criterion of absenteeism (reverse scored to indicate attendance and to eliminate negative validities) indicate a lower mean validity of .19. However, all variance is accounted for, leading to a 90% credibility value equal to the mean. Again, to our knowledge this is the first meta-analysis to examine the validity of an employment interview for this criterion measure. Although .19 is not high, utility analyses would likely find that a validity value of .19 is large enough to be of practical value in many organizations. The seven interviews for the criterion of absenteeism did not contain validities for job performance measures; however, in light of our other findings here, it should be pointed out that it is very likely that these interviews also predict job performance, creating a "bonus utility" that is probably greater than the utility of predicting absenteeism alone. The results for studies predicting job tenure are shown in line 7 of Table 3. The estimated mean true validity is .39, the SDp estimate is

460

PERSONNEL PSYCHOLOGY TABLE 4 Meta-Analyses of Job Type Within Criterion Type

Criterion type Supervisory ratings Sales jobs

Tfeachers Skilled and semiskilled Managers Sales data Life insurance sales Computer sales Financial serv. sales Other sales

P

SDp

90% CV

.00 .09 .14 .13

.38 .37 .62 .27

.00 .19 .27 .27

.38 .12 .27 .07

.00 .03 .00 .17

.48 .20 .25 .43

.04 .04 .00 .25

.48 .15 .25 .10

N

K

f

SA

SAe,

1183

12 12

5 3

.19 .18 .31 .13

.11 .21 .04 .03

7 18 3 13

.32 .13 .16 .28

.02 .07 .01 .05

380 410 254 181 5512

195 647

Note: N = number of observations; K = the number of correlations; f = the mean observed correlation; SA- = the standard deviation of observed correlations; SA-ea = the residual standard deviation; p = the estimated mean true validity; SDp = the standard deviation of the true validity; 90% C.V = the lower 90% credibility value.

.18. and the 90% credibility value is .15, indicating that validity is generalizable. As noted earlier, McDaniel et al. (1994) also examined the criterion of job tenure (see their Tkble 8 on p. 608). However, the validities contained in their meta-analysis of this criterion came from both unstructured and structured interviews. Of the 10 validity coefficients, 3 were based on structured interviews, 4 were based on unstructured interviews, and in the case of the remaining 3, too little information was reported to allow determination of degree of structure (McDaniel, personal communication, March 18,1996). This would appear to be the explanation for their lower estimated mean true validity (.20 vs. our .39). It could also account (in part) for their larger SDp value (.24 vs. our .18). In any event, unlike the McDaniel et al. analysis of the job performance criterion, the McDaniel et al. analysis of the job tenure criterion is not directly comparable to our analysis. What is needed is a meta-analysis of interview validity for job tenure based solely on traditional structured interviews. The mean estimated true validity of .39 for predicting job tenure is among the highest in Table 3. This is a value large enough to have substantial practical value. Again, interviews that predict tenure on the job will also, based on our other findings here, predict performance on the job. Hence their utility can be expected to extend beyond the reduction of turnover. As indicated in Table 1, the two most frequent criterion types were supervisory ratings of overall job performance (33 studies) and records of amount sold (41 studies). An important question is whether interview validity holds up for different job types within each of these two criterion types. Table 4 shows findings separately for different job types

SCHMIDT AND RADER

461

appearing within each of these two criterion types. For supervisory ratings, estimated mean true validities for sales jobs (.38) and for teachers (.37) are very similar to the overall figure of .40 in the first row of Table 3. The figure for blue collar skilled and semiskilled manufacturing jobs (.62) is much larger than the overall figure and the mean for managerial jobs (.27) is smaller than the overall mean of .40. Good measures of job performance may be more difficult to construct for managers than for skilled and semiskilled workers, and that might account to some extent for this difference. However, good measures of job performance are also difficult to obtain for teachers, who show a mean validity of .37. Another consideration is that these two means are each based on a small number of studies (5 for manufacturing jobs and 3 for managers) and are therefore less stable. It is probably best to interpret the .62 as an upward random bounce and the .27 as a downward random bounce until additional studies become available. The breakout by job type for the sales data criterion appears to indicate higher mean validity for life insurance sales (.48) and "other sales" (.43) than for computer sales (.20) and financial services sales (.25). However, the number of studies is small for life insurance sales and financial services sales. The most important conclusion from the analyses in Tkble 4 is that useful levels of generalizable validity continue to appear when the data are broken out by job type, suggesting that this type of interview, properly constructed, administered, and scored for each job, is valid for any type of job. General Discussion

The major focus of this study is on the boundary conditions for the validity evidence found in the research literature for employment interviews for predicting job performance. The interview examined in this meta-analysis differs in several respects from traditional structured interviews. The questions asked and the scored answers are selected empirically based on their ability to discriminate between outstanding and below average employees, rather than based on judgment informed by a traditional task-oriented job analysis. The interview is conducted by telephone without any face-to-face contact with the applicant. The person conducting the interview does not score it or evaluate the responses in any way, but rather merely tape records the applicant's answers for later scoring by specially trained scoring experts. The person conducting the interview does not elaborate on the meaning or intent of the questions, but can only repeat them if asked to do so. Yet based on 107 validity estimates, we find that the validity of this interview for predicting supervisory ratings of job performance is quite similar to that of other struc-

462

PERSONNEL PSYCHOLOGY

tured employment interviews as estimated in an earlier comprehensive meta-analysis. In addition, we found that this empirically based interview predicts other criterion measures not examined separately in past meta-analyses of interview validity studies: amount sold in sales jobs, output as measured by production records, and absenteeism. Our findings extend the boundary conditions for interview validity and suggest that these boundaries may be quite wide. Future research may reveal that there are still additional ways to construct structured employment interviews that demonstrate substantial validity. Research findings of this sort obviously have important implications for practice. Why do different approaches to construction of structured interviews have comparable validities for job performance? This study does not directly address this important question, but we would hypothesize that this finding obtains because different types of structured interviews all measure to varying degrees constructs with known generalizable validity: for example, the various facets of conscientiousness (e.g., dependability and achievement orientation), and general mental ability (cf. Ones, Viswesvaran, & Schmidt, 1993). In this connection, individual studies have sometimes found substantial correlations between structured interview scores and measures of mental ability (e.g.. Campion, Pursell, & Brown, 1988). McDaniel et al. (1994) and Huffcutt, Roth, and McDaniel (1996) presented meta-analytic evidence that interview scores correlate positively with measured mental ability and therefore in part refiect this construct. However, much additional research will be necessary to fully explicate the constructs measured by employment interviews. Research on this question is now under way for the Gallup interview. A key question concerning this interview for practitioners is whether, on balance, it is more advantageous to use than a traditional structured interview. According to Gallup, one advantage of this interview is that the information obtained can be used not only for selection but also for development. That is, employers can obtain from the analyst who scores the interview useful information on each person hired as to how to best supervise, train, and develop that person (Gallup, 1996, Sec. B, Selection Research Standards). It is not clear that such information is routinely available using the traditional structured interview. Another potential advantage is that this type of interview requires a less skilled, less trained interviewer, because all the interviewer does is read the questions verbatim. Training and skill building are, of course, costly. However, this cost savings is offset by the higher costs of training and calibrating the analysts who score the interviews, so there is probably no net savings here in this respect. This type of interview separates the interviewer from the interviewer scorer. One effect of this might be to reduce or eliminate the effect

SCHMIDT AND RADER

463

of reviewer biases based on appearance, accent, background, and other such factors, thus increasing the objectivity of the interview evaluation. This fact could have implications for defensibility. This interview has in practice proven to be quite cost efficient. Large numbers of applicants have been interviewed by telephone all over the U.S. (and even the world) without travel costs or facilities costs. In a typical application, a large retail or industrial firm with 200 to 400 locations in the U.S. interviews for local store or plant manager positions in many of these locations. However, as the reviewers pointed out, there is no reason in principle why the traditional structured interview could not also be administered by telephone. So administration by telephone need not be an exclusive advantage of this approach to interviewing. On balance, it is our conclusion that this type of interview has a sufficient number of positive features that it should receive consideration from practitioners in designing selection systems. One reviewer suggested that any interview administered by telephone might encounter a cheating problem: the interviewee might have someone else take the interview for him or her. This problem would perhaps be more likely to occur in the telephone administration of a traditional structured interview than in the case of interview described in this paper. The questions asked in this type of interview are much more personal (e.g., "How competitive are you?"); there are no obvious right or wrong answers and it is hard to imagine an applicant believing that someone else could answer such questions for him or her better than the individual himself could. In addition, the imposter would have to have detailed knowledge of the applicant's past work experiences, emotional reactions to events within those experiences, preferences, likes, dislikes, and so forth. Given these considerations, it would appear unlikely that substitute interviewees would occur with any frequency. Although it could have occurred without their knowledge, Gallup practitioners state that they have not encountered this problem. REFERENCES American Society for Personnel Administration. (1983). ASPA-BNA Survey No. 45: Employment selection procedures. Washington, DC: Bureau of National Affairs. Campion MA, Pursell ED, Brown BK. (1988). Structured interviewing: Raising the psychometric properties of the employment interview, PERSONNEL PSYCHOLOGY, 41, 25^2. Gallup, Inc. (Undated). The teacherperceiver interview. Lincoln, NE: Author. Gallup, Inc. (1996). The Gallup way. Lincoln, NE: Author. Huffcutt Al, Arthur W Jr. (1994). Hunter and Hunter revisited: Interview validity for entry-level jobs. Journal of Applied Psychology, 79,184-190.

464

PERSONNEL PSYCHOLOGY

Huffcutt Al, Roth PL, McDaniel MA. (1996). A meta-analytic investigation of cognitive ability in employment interview evaluations: Moderating characteristics and implications for Incremental validity. Joumat of Applied Psychology, 81,459-473. Hunter JE, Hunter RF. (1984). Validity and utility of alternate predictors of job performance. Psychological Bulletin, 96, 72-98. Hunter JE, Schmidt FL. (1990). Methods of meta-analysis: Correcting error and bias in research findings. Newbury Park, CA: Sage. Hunter JE, Schmidt FL. (1994). The estimation of sampling error variance in metaanalysis of correlations: The homogenous case. Joumat of Applied Psychology, 79, 171-177. Hunter JE, Schmidt FL, Judiesch MK. (1990). Individual differences in output variability as a function of job complexity. Journal of Applied Psychology, 75,28-42. Law KS, Schmidt FL, Hunter JE. (1994). Nonlinearity of range corrections in metaanalysis: A test of an improved procedure. Journal of Applied Psychology, 79,425438. McDaniel MA, Schmidt FL, Hunter JE. (1988). A meta-analysis of the validity of methods for rating training and experience in personnel selection, PERSONNEL PSYCHOLOGY, 41, 283-314. McDaniel MA, Whetzel D, Schmidt FL, Maurer S. (1994). The validity of employment interviews: A comprehensive review and meta-analysis. Journal of Applied Psychology, 79,599-616. Ones DS, Viswesvaren C, Schmidt FL. (1993). Comprehensive meta-analysis of integrity test validities: Findings and implications for personnel selection and theories of job performance. Journal of Applied Psychology Monograph, 78,677-703. Pearlman K, Schmidt FL, Hunter JE. (1980). Validity generalization results for tests used to predict training success and job proficiency in clerical positions. Journal of Applied Psychology, 65,373-406. Rothstein HR. (1990). Interrater reliability of job performance ratings: Growth to asymptote level with increasing opportunity to observe. Journal of Applied Psychology, 75, 322-327. Schmidt FL, Caplan JR, Bemis S, Decuir R, Dunn L, Antone L. (1979). The behavioral consistency method of unassembled examining. Washington DC: U.S. Office of Personnel Management, Personnel Research and Development Center, Research Branch. Schmidt FL, Hunter JE. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1,199-223. Schmidt FL, U w KS, Hunter JE, Rothstein HR, Pearlman K, McDaniel M. (1993). Refinements in validity generalization methods: Implications for the situational specificity hypothesis. Journal of Applied Psychology, 78,3-12. Schmidt FL, Rothstein HR. (1994). Application of validity generalization methods of meta-analysis to biographical data scores in employment selection. In Stokes GS, Mumford MD, Owens WA (Eds.), The biodata handbook: Theory, research, and applications. Consulting Psychologists Press, 237-260. Selection Research, Inc. (1992). SRI theme theory. Lincoln, NE: Author. Society for Industrial and Organizational Psychology, Inc. (1987). Principles for the validation and use ofpersonnel selection procedures, 3rd ed. College Park, MD: Author. Ulrich L, Trumbo D. (1965). The selection interview since 1949. Psychological Bulletin, 63, 100-116. Viswesvaran C, Ones DS, Schmidt FL. (1996). A comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81,557-560.