Beyond Alpha: An Empirical Examination of the

Psychological Methods 2003, Vol. 8, No. 2, 206–224

Copyright 2003 by the American Psychological Association, Inc. 1082-989X/03/$12.00 DOI: 10.1037/1082-989X.8.2.206

Beyond Alpha: An Empirical Examination of the Effects of Different Sources of Measurement Error on Reliability Estimates for Measures of Individual Differences Constructs Frank L. Schmidt and Huy Le

Remus Ilies

University of Iowa

University of Florida

On the basis of an empirical study of measures of constructs from the cognitive domain, the personality domain, and the domain of affective traits, the authors of this study examine the implications of transient measurement error for the measurement of frequently studied individual differences variables. The authors clarify relevant reliability concepts as they relate to transient error and present a procedure for estimating the coefficient of equivalence and stability (L. J. Cronbach, 1947), the only classical reliability coefficient that assesses all 3 major sources of measurement error (random response, transient, and specific factor errors). The authors conclude that transient error exists in all 3 trait domains and is especially large in the domain of affective traits. Their findings indicate that the nearly universal use of the coefficient of equivalence (Cronbach’s alpha; L. J. Cronbach, 1951), which fails to assess transient error, leads to overestimates of reliability and undercorrections for biases due to measurement error.

relationships of substantive interest—the observed relationships are often adjusted for the effect of measurement error; for these true score relationship estimates to be accurate the adjustments need to be precise. An important class of measurement errors that is often ignored when estimating reliability of measures is transient error (Becker, 2000; Thorndike, 1951). If transient errors indeed exist, ignoring them in the estimation of the reliability of a specific measure leads to overestimates of reliability. Perhaps the most important consequence of overestimating reliability is the underestimation of the relationships between psychological constructs due to undercorrecting for the bias due to measurement error. This may lead to undesirable consequences potentially hindering scientific progress (Schmidt, 1999; Schmidt & Hunter, 1999). Transient errors are defined as longitudinal variations in responses to measures that are produced by random variations in respondents’ psychological states across time. Thus transient errors are randomly distributed across time (occasions). Consequently, the score variance produced by these random variations is not relevant to the construct that is measured and should be partialed out when estimating true score relationships. Measurement experts have long been aware of tran-

Psychological measurement and measurement theory are fundamental tools that make progress in psychological research possible. Estimating relationships between constructs and testing hypotheses are essential to the advancement of psychological theories. However, research cannot estimate relationships between constructs directly. Constructs are measured, and it is relationships between scores on the construct measures that are directly estimated. The process of measurement always contains error that biases estimates of relationships between constructs (there is no such thing as a perfectly reliable measure). Because of measurement error, the relationships between specific measures (observed relationships) underestimate the relationships between the constructs (true score relationships). To estimate true score relationships—the

Frank L. Schmidt and Huy Le, Department of Management and Organizations, Henry B. Tippie College of Business, University of Iowa; Remus Ilies, Department of Management, Warrington College of Business Administration, University of Florida. Correspondence concerning this article should be addressed to Frank L. Schmidt, Department of Management and Organizations, Henry B. Tippie College of Business, University of Iowa, Iowa City, Iowa 52242. E-mail: [email protected]

206

207

BEYOND ALPHA

sient error. Thorndike (1951), for example, defined transient measurement error over 50 years ago and called for empirical research to calibrate its magnitude. However, to our knowledge, Becker (2000) provided the first empirical examination of the effects of transient error on the reliability of measures of psychological constructs. His study showed that the effect of transient error, while being negligible in some measures (0.02% of total score variance for a measure of verbal aggression), can be substantial in others (10.01% of total score variance for a measure of hostility). Measurement error variance of such magnitude can substantially attenuate observed relationships between constructs and lead to erroneous conclusions in substantive research unless it is appropriately corrected for. Becker’s findings indicate the desirability of more extensive examination of the effects of transient error in measures of psychological constructs. The research presented here investigates the implications of transient error in the measurement of important individual differences variables. In an attempt to cover a significant part of the individual differences area, we examined measures of important constructs from (a) the cognitive domain (cognitive ability), (b) the broad personality domain (the Big Five factors of personality, generalized self-efficacy, and selfesteem), and (c) the domain of affective traits (positive affectivity [PA]; negative affectivity [NA]). Becker’s (2000) study focused on the four subscales of the Buss–Perry Aggression Questionnaire (Buss & Perry, 1992). In this study, we examine 10 constructs more central to research in differential psychology. To clarify the impact of transient error processes, we first review classical measurement methods used to correct for biases introduced by measurement error when estimating relationships between constructs and develop a method to improve the accuracy of this correction process. We then apply this method to study the impact of transient errors on the measurement of individual differences constructs from the categories noted above, which is the substantive purpose of this study. In classical measurement theory, the fundamental formula for the observed correlation in the population between two measures (x and y) is ␳xy ⳱ ␳xtyt (␳xx ␳yy)1/2,

(1)

where ␳xy is the observed correlation, ␳xtyt is the correlation between the true scores of (constructs underlying) measures x and y, and ␳xx and ␳yy are the reliabilities of x and y, respectively. This is called the

attenuation formula because it shows how measurement error in the x and y measures reduces the observed correlation (␳xy) below the true score correlation (␳xtyt). Solving this equation for ␳xtyt yields the disattenuation formula: ␳xtyt ⳱ ␳xy/(␳xx ␳yy)1/2.

(2)

With an infinite sample size (i.e., in the population), this formula is perfectly accurate (so long as the appropriate type of reliability estimates is used). In the smaller samples used in research, there are sampling errors in the estimates of ␳xy, ␳xx, and ␳yy, and therefore there is also sampling error in the estimate of ␳xtyt. Because of this, a circumflex is used in Equation 3 below to indicate that ␳xtyt is estimated. For the same reason, rxy, rxx, and ryy are used instead of ␳xy, ␳xx, and ␳yy, respectively. The disattenuation formula therefore can be rewritten as follows: ␳ˆ xtyt ⳱ rxy/(rxx ryy)1/2.

(3)

Thus, ␳ˆ xtyt is the estimated correlation between the construct underlying the measure x and the construct underlying the measure y; this estimate approximates the population value (␳xt yt) when the sample size is reasonably large. As Equation 3 shows, for the estimation of the true score correlation between constructs measured by x and y (␳ˆ xtyt in Equation 3) to be unbiased, the estimates of the reliabilities of measures x and y need to be accurate. To accurately estimate reliabilities of construct measures, one must understand and account for the processes that produce measurement error. Three major error processes are relevant to all psychological measurement: random response error, transient error, and specific factor error. (A fourth error process, disagreement between raters, is relevant only to measurement that uses multiple raters of the constructs that are measured.) Depending on which sources of error variance are taken into account when computing reliability estimates, the magnitude of the reliability estimates can vary substantially, which can lead to erroneous interpretations of observed relationships between psychological constructs (Schmidt, 1999; Schmidt & Hunter, 1999; Thorndike, 1951). Schmidt and Hunter (1999) observed that measurement error is typically not defined substantively in psychological research. These authors asserted that unless the psychological processes that produce measurement error are described, “one is left with the impression that measurement error springs from hidden and unknown sources and its nature is mysteri-

208

SCHMIDT, LE, AND ILIES

ous” (Schmidt & Hunter, 1999, p. 192). Thus, following Schmidt and Hunter, we first describe the substantive nature of the three main types of error variance sources. Next we examine the bias (in estimating the reliability of measures) associated with selective consideration of these sources. Finally, we propose a method of computing reliability estimates that takes into consideration all main sources of measurement error, and we present an empirical study of the effect of the different types of error variance sources on measures of important individual differences constructs. It has long been known that many sources of variance influence the observed variance of measures in addition to the constructs they are meant to measure (Cronbach, 1947; Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Thorndike, 1951). Variance due to these sources is measurement error variance, and its biasing effects must be removed in the process of estimating the true relationships between constructs. Although the multifaceted nature of measurement error has occasionally been examined in classical measurement theory (e.g., Cronbach, 1947), it is the central focus of generalizability theory (Cronbach et al., 1972). Although there is potentially a large number of measurement error sources (termed facets in generalizability theory; Cronbach et al., 1972), only a few sources significantly contribute to the observed variance of self-report measures (Le & Schmidt, 2001). These sources are random response error, transient error, and specific factor error.

Random Response Error Random response error is caused by momentary variations in attention, mental efficiency, distractions, and so forth within a given occasion. It is specific to a moment when subjects respond to an item of a measure. As such, a subject may, for example, provide different answers to the same item if it appears in two places in a questionnaire. These variations can be thought of as noise in the human central nervous system (Schmidt & Hunter, 1999, Thorndike, 1951). Random response error is reduced by summing item scores. Thus, other things equal, the larger the number of items in a specific scale, the greater the extent to which random response error is reduced.

Transient Error Whereas random response error occurs across items (i.e., discrete time moments) within the same occa-

sion, transient error occurs across occasions. Transient errors are produced by longitudinal variations in respondents’ mood, feelings, or in the efficiency of the information processing mechanisms used to answer questionnaires (Becker, 2000; Cronbach, 1947, DeShon, 1998; Schmidt & Hunter, 1999). Thus, those occasion-characteristic variations influence the responses to all items that are answered on that specific occasion. For example, respondents’ affective state (mood) will influence their responses to various satisfaction questionnaires, in that respondents in a more pleasurable affective state on that occasion will be more likely to rate their satisfaction with various aspects of their lives or jobs higher. If those responses are used to examine cross-sectional relationships between satisfaction constructs and other constructs, the variations in the satisfaction scores across occasions should be treated as measurement error (i.e., transient errors). Under the conceptualization of generalizability theory (Cronbach et al., 1972), transient error is the interaction between persons (measurement objects) and occasions. Transient error can be defined only when the construct to be measured is assumed to be substantively stable across the time period over which transient error is assessed. Individual differences constructs such as intelligence or various personality traits can safely be assumed to be substantively stable across at least short periods of time, their score variation across occasions being caused by transient error. For constructs that exhibit systematic variations even across short time periods, such as mood, changes in the magnitude of the measurement cannot be considered the result of transient errors because those changes have substantive (theoretically significant) causes that can and should be studied and understood (Watson, 2000). If transient error indeed has a nonnegligible impact on the measurement of individual differences (i.e., it reduces the reliability of such measures), this impact needs to be estimated, and relationships among constructs need to be corrected for the downwardly biasing effect of transient error on these relationships. That is, observed relationships should be corrected for unreliability using reliability estimates that take into account transient error. To assess the impact of transient error on the measurement of psychological constructs, one estimates the amount of transient error in the scores of a measure by correlating responses across occasions. As noted, much too often transient error is ignored when estimating reliability of measures (Becker, 2000).

BEYOND ALPHA

Specific Factor Error This type of measurement error arises from the subjects’ idiosyncratic response to some aspect of the measurement situation (Schmidt & Hunter, 1999). At the item level, specific factor errors are produced by respondent-specific interpretation of the wording of questionnaire items. In that sense, specific factor errors correspond to the interaction between respondents and items. Specific factors are not part of the constructs being measured because they do not correlate with scores on other items measuring that construct (i.e., they are item specific); thus they are measurement errors. Specific factor errors tend to cancel each other out across different items, but for a specific item they replicate across occasions. At the scale level, different scales that were created to measure the same construct may also contain specific factor errors. Scale-specific factor error can be controlled only by using multiple scales to measure the same construct, thus defining the construct being measured by what is measured and shared by all scales (Le & Schmidt, 2001; Schmidt & Hunter, 1999). That is, the construct is defined as the factor that the scales have in common.

Calibrating Measurement Error Processes With Reliability Coefficients By definition, reliability is the ratio of true score variance to observed score variance. The basic conceptual difference between methods of computing reliability lies in the ways in which they define and estimate true score variance. Estimates of true score variance based on different methods may include variance components due to measurement errors in addition to those due to the construct of interest. To put it in another way, different methods implicitly assess the extent to which different error processes (i.e., random response, transient, and specific factor) produce variation in test scores that is not relevant to the underlying trait to be measured (i.e., measurement error). The magnitude of an error process is assessed by a reliability estimate only when that error process is allowed to reduce the size of the reliability estimate; the magnitude of the reduction is a measure of the impact of the error process that produced it. It follows that some methods of computing reliability are more appropriate than others because they better model the various error processes. These issues have in fact long been known and discussed by measurement researchers (e.g., Cronbach, 1947; Thorndike, 1951). Never-

209

theless, we reexamine the topic here to clearly define the problem and then offer a solution. Specifically, we discuss below how popular estimates of reliability assess different sources of measurement errors.

Internal Consistency Reliability This method of computing reliability is probably the one most frequently used in psychological research, and computations of internal consistency are included in all standard statistical analysis programs. Internal consistency reliability assesses specific factor error and random response error. Specific factor errors and random response error are independently distributed across items and tend to cancel each other out when item scores are summated. The most popular index of internal consistency of a measure is coefficient alpha (Cronbach, 1951). Cronbach (1951) showed that coefficient alpha is the average of all the split-half reliabilities. A split-half reliability is the correlation between two halves of a scale, adjusted to reflect the reliability of the fulllength scale. As such, coefficient alpha is conceptually equivalent to the correlation between two parallel forms of a scale administered on the same occasion (Cronbach, 1951). Parallel forms of a scale are different sets of items having identical loadings on the construct to be measured but having different specific factors (further definition and discussion of parallel forms is provided in a subsequent section). When parallel forms of a scale administered on the same occasion are correlated, measurement error variance due to specific factor error and random response error reduces the correlation between the parallel forms. Thus, in practice, internal consistency reliability can be estimated three different ways: (a) by the correlation between parallel forms of a scale; (b) by splitting a scale into halves, computing the correlation between the resulting subscales and correcting this estimate for the reduced number of items with the Spearman– Brown prophecy formula; and (c) by using formulas specific to the measurement situation such as those for Cronbach’s alpha (when items are scored on a continuum; Cronbach, 1951) or Kuder–Richardson–20 (when items are scored dichotomously). Because the most direct method for computing internal consistency is to correlate parallel forms of the scale administered on one occasion, this class of reliability coefficient estimates is referred to as the coefficient of equivalence (CE; Anastasi, 1988; Cronbach, 1947). Feldt and Brennan (1989) provided a formula illustrating the variance components included as true score

210


variance by the CE (Equation 102, p. 135). The CE assesses the magnitude of measurement error produced by specific factor and random response error but not by transient error processes.

Test–Retest Reliability Test–retest reliability, as its name implies, is computed by correlating the scores on the same form of a measure across two different occasions. The test– retest reliability coefficient is alternatively referred to as the coefficient of stability (CS; Cronbach, 1947) because it estimates the extent to which scores on a specific measure are stable across time (occasions). The CS assesses the magnitude of transient error and random response error, but it does not assess specific factor error, as shown by Feldt and Brennan (1989; Equation 100, p. 135).

Coefficient of Equivalence and Stability The coefficient of equivalence assesses specific factor error and random response error but not transient error, whereas the CS assesses transient and random response error but not specific factor error. The only type of reliability coefficient that estimates the magnitude of the measurement error produced by all three error processes is the coefficient of equivalence and stability (CES; Cronbach, 1947). The CES is computed by correlating two parallel forms of a measure, each administered on a different occasion. The use of parallel forms administered on one occasion assesses only specific factor error and random response error, and the administration of the same measure across two different occasions assesses only transient error and random response error. Hence, the CES is the ideal reliability estimate because its magnitude is appropriately reduced by all three sources of measurement error. Supporting this argument, Feldt and Brennan (1989) provided a formula for the CES, showing that its estimated true score variance includes only variance due to the construct of interest (Equation 101, p. 135). The other types of reliability coefficients selectively take into account only some sources of measurement error, and thus they overestimate the reliability of the scale.

Computing the Coefficient of Equivalence and Stability As already noted, to compute the CES of a measure for a specific sample, two parallel forms of the measure, each administered on a different occasion, are

required. The correlation of the scores on these parallel forms is reduced by all three main sources of measurement error, and it equals the CES. However, in many cases, parallel forms of the same measure may not be available. For such cases, we present here an alternative method for estimating the CES of a scale from the CES of its half scales that were obtained either from (a) administering the scale on two distinct occasions and then splitting it post hoc to form parallel half scales or (b) splitting the scale into parallel halves then administering each half on each separate occasion. The former approach (i.e., post hoc division) provides us with multiple estimates of the CE and CES coefficients of the half scales, which we can average to reduce the effect of sampling error, thereby obtaining more accurate estimates of the coefficients of interest. However, administering the same scale on two occasions with a short interval can result in measurement reactivity (i.e., subjects’ memory of item content could affect their responses on the second occasion; Stanley, 1971; Thorndike, 1951). This problem is especially likely for measures of cognitive constructs. In such a situation, it may be advisable to adopt the latter approach: The scale should be split into halves first, with each half administered on only one occasion. After estimating the half-scale CES, one needs to compute the CES for the full scale. Using the Spearman–Brown prophecy formula to adjust for the increased (double) number of items for the full scale appears to be the most direct approach. However, Feldt and Brennan (1989) have cautioned that the Spearman–Brown formula does not apply when there are multiple sources of measurement error involved (Feldt & Brennan, 1989, p. 134), which is the case here. We demonstrate below that use of the Spearman–Brown formula is not accurate in this case. Our derivations assume that all the statistics are obtained from a very large sample, so there is no sampling error; in other words, they are population parameters. We further assume that there exist parallel forms for the scale in question, and these forms can be further split into parallel subscales. Parallel forms are defined under classical measurement theory to be scales that (a) have the same true score and (b) have the same measurement error variance (Traub, 1994). It can be readily inferred from this definition that parallel forms have the same standard deviation and reliability coefficient (ratio of true score variance to observed score variance). Parallel forms thus defined are strictly parallel, as contrasted to randomly parallel

211

BEYOND ALPHA

forms, which are scales having the same number of items randomly sampled from a hypothetical infinite item domain of the construct in question (Nunnally & Bernstein, 1994; Stanley, 1971). Randomly parallel forms may have different true score and observed score variances due to sampling error resulting from the sampling of items. Let A and B be two parallel forms of a scale that were both administered at two times: Time 1 and Time 2. 1A symbolizes Form A administered at Time 1; 2A symbolizes Form A administered at Time 2; 1B symbolizes Form B administered at Time 1; and 2B symbolizes Form B administered at Time 2. With this notation, the CES can be written as CES ⳱ ␳(1A,2B) ⳱ ␳(2A,1B).

(4)

If we split A to create two parallel half scales, a1 and a2, and split B to create two parallel half scales, b1 and b2, we then have four parallel half scales, a1, a2, b1, and b2, and A = a1 + a2,

(5)

B = b1 + b2.

(6)

From Equations 4, 5, and 6, with prefixes 1 and 2 denoting the time when a half scale is administered, we have CES ⳱ ␳(1a1 + 1a2, 2b1 + 2b2) ⳱ ␳(2a1 + 2a2, 1b1 + 1b2).

(7)

Because a1, a2, b1, and b2 are parallel scales, their standard deviations are the same. Further, the standard deviation should be invariant across Times 1 and 2. Symbolizing this value, ␴, we have ␴(1a1) = ␴(1a2) = ␴(1b1) = ␴(1b2) = ␴(2a1) = ␴(2a2) = ␴(2b1) = ␴(2b2) = ␴.

(8)

Further, let the CE of the half scales be symbolized as ce; this coefficient can be written as ce ⳱ ␳(1a1,1a2) ⳱ ␳(2a1,2a2) ⳱ ␳(1b1,1b2) ⳱ ␳(2b1,2b2). (9) Let the CES of the half scales be symbolized as ces; by definition this coefficient is the correlation between any different half scales across times: ces = ␳(1a1,2a2) = ␳(1a1,2b1) = ␳(1a1,2b2) = ␳(1a2,2a1) = ␳(1a2,2b1) = ␳(1a2,2b2) = ␳(1b1,2a1) = ␳(1b1,2a2) = ␳(1b1,2b2) = ␳(1b2,2a1) = ␳(1b2,2a2) = ␳(1b2,2b1).

(10)

From Equation 7 above and the definition of the correlation coefficient, we have CES = ␳(1a1 + 1a2, 2b1 + 2b2) Cov[1a1 + 1a2, 2b1 + 2b2] . = [Var(1a1 + 1a2)]1Ⲑ2 ⭈ [Var(2b1 + 2b2)]1Ⲑ2 (11) Using Equations 8 and 10 above, we can rewrite the numerator of the right side of Equation 11 as follows: Cov(1a1,2b1) + Cov(1a1,2b2) + Cov(1a2,2b1) + Cov(1a2,2b2) = ␳(1a1,2b1)␴(1a1)␴(2b1) + ␳(1a1,2b2)␴(1a1)␴(2b2) + ␳(1a2,2b1)␴(1a2)␴(2b1) + ␳(1a2,2b2) ␴(1a2)␴(2b2) = 4 ␴2 ces.

(12)

Using Equations 8 and 9 above, we can rewrite the denominator of the right side of Equation 11 as [␴2(1a1) + ␴2(1a2) + 2␳(1a1,1a2)␴(1a1)␴(1a2)]1Ⲑ2 [␴2(2b1) + ␴2(2b2) + 2␳(2b1,2b2)␴(2b1)␴(2b2)]1Ⲑ2 = 2␴2(1 + ce). (13) Finally, from Equations 12 and 13, we have CES ⳱ 2 ces/(1 + ce).

(14)

Equation 14 gives a simple formula for computing the CES of a full scale that was split into halves from the CES of the half scales. It can be inferred from above derivations that this equation yields the exact value of CES when (a) all statistics are obtained from very large sample size, and (b) strictly parallel half scales can be formed by appropriately splitting the full scale. To the extent that those assumptions do not hold in practice (i.e., there is a limited sample size and departures from strict parallelism of half scales), the estimate of CES is affected by sampling error in both subjects and items (Feldt, 1965; Lord, 1955). Nevertheless, the formula presented here provides a convenient way to obtain an unbiased and asymptotically accurate estimate of CES for any measure with real samples (i.e., sample sizes are limited) and in the absence of perfectly parallel forms. When half scales are formed by post hoc splitting of a full scale as described above, sampling error can be reduced by averaging statistics of subscales to obtain more accurate estimates of population values. If the scales are split in random halves (instead of strictly parallel halves), the resulting reliability coefficient estimates the random CES (Nunnally & Bernstein, 1994). We note here that the use of the Spearman–Brown

212


prophecy formula is not appropriate to adjust the CES for the length of the full scale (though it would be appropriate to adjust the CE). If the Spearman–Brown formula were erroneously applied here to estimate CES from ces, the erroneous estimate of the adjusted CES would be CES ⳱ 2 ces/(1 + ces).

Extending the Formula for Estimating the CES Equation 14 provides a formula for computing the CES of a full-scale from the CE and CES of its half scales. To derive the equation, we have assumed that (a) a scale can be split into parallel half scales, and (b) the psychometric properties of the half scales (standard deviations and ce) remain unchanged across occasions. Arguably, these assumptions are not strong and are likely to be met in practice. Nevertheless, as an anonymous reviewer pointed out, the assumptions can be further relaxed. When measurement reactivity is not a problem and thus the same scale can be administered at both times (i.e. the post hoc splitting approach mentioned earlier is adopted), the following equation can be used to estimate the CES of the scale:1 2[Cov (1a1,2a2) + Cov(2a1,1a2)] , [Var(1A)]1Ⲑ2 ⭈ [Var(2A)]1Ⲑ2

CES =

Cov(1a1,2a2) + Cov(2a1,1a2)

, 2p1p2[Var(1A)]1Ⲑ2 ⭈ [Var(2A)]1Ⲑ2

(17a)

(15)

Comparing Equations 14 and 15 reveals that adjusting CES with the Spearman–Brown formula results in overestimation of the full-scale CES (because ces is always smaller than ce).

CES =

subscales, and then the following equation can be used to estimate the CES of the full scale from information on the subscales:

(16)

where Cov(1a1,2a2) is the covariance between the half-scale a1 administered at Time 1 (1a1) and the half-scale a2 administered at Time 2 (2a2); Cov(2a1, 1a2) is the covariance between the half-scale a1 administered at Time 2 (2a1) and the half-scale a2 administered at Time 1 (1a2); Var(1A) is the variance of the Full-Scale A administered at Time 1; and Var(2A) is the variance of the Full-Scale A administered at Time 2. Equation 16 assumes only that Scale A can be divided into two parallel forms, a1 and a2. In practice, however, there are scales with an odd number of items; hence it is not possible to divide them into parallel half scales. The assumption underlying Equation 16 can then be further relaxed to cover this situation. A scale can now be simply divided into two

where Cov(1a1, 2a2) is the covariance between the subscale a1 administered at Time 1 (1a1) and the subscale a2 administered at Time 2 (2a2); Cov(2a1, 1a2) is the covariance between the subscale a1 administered at Time 2 (2a1) and the subscale a2 administered at Time 1 (1a2); Var(1A) is the variance of the FullScale A administered at Time 1; Var(2A) is the variance of the Full-Scale A administered at Time 2; p1 is the ratio of number of items in the subscale a1 to that in the full scale; p2 is the ratio of number of items in the subscale a2 to that in the full scale. Equations 16 and 17a can be applied only when the same scale is used at both times (the post hoc splitting approach). In situations where it is impossible to apply this approach because of time constraint or concern about measurement reactivity, a modification can be introduced into Equation 17a to enable estimating the CES for scales with an odd number of items: CES =

Cov(1a1,2a2)

, p1p2[Vâr(1A)]1Ⲑ2 ⭈ [Vâr(2A)]1Ⲑ2

(17b)

where Cov(1a1,2a2) is the covariance between the subscale a1 administered at Time 1 (1a1) and the subscale a2 administered at Time 2 (2a2); p1, p2 are the proportions of total items in the subscales a1 and a2; and Vaˆ r(1A) and Vaˆ r(2A) are the estimated variances of the Full-Scale A. As seen from above, we need to estimate the variances of the Full-Scale A when it is administered at Times 1 and 2 (which are not directly available because only one subscale is administered at each time in this situation). These values can be obtained from the following equations (Appendix A provides derivations of these equations):

1

We are grateful to an anonymous reviewer for suggesting the way to relax the assumptions underlying Equation 14. The reviewer also provided Equations 16 and 17 and their derivations. Derivation details are available from Frank L. Schmidt upon request.

BEYOND ALPHA

Var(1A) =

Var(1a1)[p1 − (p1 − 1)ce1] , p21

(18a)

Var(2A) =

Var(2a2)[p2 − (p2 − 1)ce2] , p22

(18b)

where ce1 and ce2 are the coefficients of equivalence of subscale a1 and subscale a2, respectively. From Equation 16 to Equation 17b above, we have gradually relaxed the assumptions required for estimating the CES. This is achieved at the cost of simplicity: those equations are more complicated than Equation 14. Using Equations 16 and 17a also requires an additional assumption, that is, measurement reactivity is not a problem. Equation 17b further relaxes this assumption by allowing the CES to be estimated by subscales administered at two different times. All these equations, especially Equation 17b, can be used in most practical situations where the assumptions underlying Equation 14 are not likely to hold.2

The Impact of Transient Error on the Measurement of Individual Differences Constructs As shown above, the only type of reliability that takes into consideration all the main sources of measurement error is the CES. The most widely used reliability coefficient, Cronbach’s alpha, does not take into account transient error. Thus, Cronbach’s alpha (i.e., CE) is a potential overestimate of the reliability of a measure (if transient error processes indeed impact the measured scores). By computing the CES with the method detailed in the preceding section we can estimate the extent to which CE overestimates reliability: subtracting CES from CE yields the proportion of observed variance produced by transient error processes (Feldt & Brennan, 1989; Le & Schmidt, 2001). This approach was previously adopted by Becker (2000) to estimate the effect of transient error. For constructs that exhibit trait-like characteristics, the processes that produce temporal variations in the measurements of a specific construct should be treated as measurement error processes, but for statelike constructs, temporal variations of the measurements cannot be treated as measurement error. Extending the trait–state dichotomy along a continuum, it follows that, except for constructs that can be placed at the ends of this continuum (such as intelligence, which can be placed at the trait end, or emotions, which are at the state end), the trait versus state dis-

213

tinction depends—at least for the purpose of defining transient error—on the specific (theoretical) definition of the construct that is measured. For example the Positive and Negative Affect Schedule (PANAS; Watson, Clark, & Tellegen, 1988) scales can be used to assess both temporary mood-states (affect) and stable-traits (affectivity) constructs,3 the difference being only in the instructions given to the respondents (momentary instructions vs. general or long-term instructions). Furthermore, empirical evidence shows that average levels of daily state affect are substantially correlated with affectivity trait measures (r ⳱ .64 and r ⳱ .53, for Positive Affect, [PA] and Negative Affect [NA], respectively; Watson and Clark, 1994), thus scores of momentary affect ratings (i.e., mood states) can be considered indicators of traitlevel affectivity if such measures are averaged across multiple occasions. The larger the number of occasions that are averaged over, the more closely the average approaches a trait measure. A particularly illustrative example of how the same measures can be used to construct scores whose position on the trait–state continuum varies from state to trait was presented by Watson (2000). He presented test–retest correlations for scores that were constructed by aggregating daily mood measures across a number of days increasing from 1 to 14. The test– retest correlation for the composite scores increased monotonically with the number of days entering the composite, from .44 to .85 for Positive Affect and from .37 to .77 for Negative Affect (Watson, 2000; p. 147, Table 5.1.). That is, the more closely the measure approximated a trait measure, the more stable it became. Transient measurement error processes were expected to affect the measurement of all constructs included in this study, but the impact on the measurement of the different traits may not be equal. Conley (1984) estimated the “annual stabilities” (i.e., average 1-year test–retest correlation between measures, cor-

2

From our available data (not from this study), all these equations yield very similar results, even when the assumptions for Equation 14 are violated (i.e., when standard deviations of a scale are different across testing occasions). In other words, Equation 14 appears to be quite robust to violation of its assumptions. 3 In the literature, affective states (moods) and affective traits (personality) are referred to as affect and affectivity, respectively. We adopt this terminology in this article.

214


rected for measurement error using coefficient alpha) of measures of intelligence, personality, and selfopinion (i.e., self-esteem) to be .99, .98, and .94, respectively. Those findings may actually reflect differences in (a) stabilities of the constructs (as argued by Conley), (b) the susceptibilities of their measures to transient error, or (c) both. Schmidt and Hunter (1996, 1999), on the basis of initial empirical evidence, hypothesized that transient error is lower in the cognitive domain than in the noncognitive domain. In accordance with Schmidt and Hunter’s hypothesis (1996, 1999) we expected transient error to be the smallest in the cognitive domain. Within the personality domain, we expected transient error to influence affectivity traits to a larger extent than broad personality traits. This expectation is suggested by the relatively lower test–retest correlations often observed in measures of affectivity traits when compared with those of other dispositional measures (Judge & Bretz, 1993). Even when compared with the Big Five traits to which they are frequently related (i.e., Extraversion and Neuroticism, respectively; e.g., Brief, 1998; Watson, 2000), positive affectivity and negative affectivity display relatively smaller stability coefficients over moderate time periods (2.5–3 years; Vaidya, Gray, Haig, & Watson, 2001). Vaidya et al. (2001) found average test–retest correlation for positive affectivity and negative affectivity to be .54, whereas the average test–retest correlation for Extraversion and Neuroticism was .63. However, as mentioned above, with such a relatively long interval, the lower test–retest correlations could be due to the lower stability of the constructs, or larger transient error, or both. It is not clear from the current research evidence which source (instability of the constructs or transient error) is responsible for the low test–retest correlations often observed. Our study attempts to disentangle effects of these sources by examining only the effect of transient error. We accomplished this by using a short test– retest interval (one week), during which we can be fairly certain that the constructs in question are stable. From a theoretical perspective, people may not be able to entirely separate self-assessments of trait affectivity from current affective states (Schmidt, 1999). That is, mood states may, to some extent, influence self-ratings of trait affectivity. This influence will translate into increased levels of transient error in the measurement of trait-level affectivity. We expect the impact of mood states on the self-reports of trait affectivity to be the largest for the PANAS measures because these scales use mood descriptors to assess

affectivity; consequently, the amount of transient error should be largest for the PANAS measures among those examined in the study.

Method Procedure To investigate these hypotheses, we conducted a study at a midwestern university. Two hundred thirtyfive students enrolled in an introductory management course participated in the study. Participation was voluntary and was rewarded with course credits. A total of 123 women (52%) and 112 men (48%) were in the initial sample. The average age of the participants was 21.5 years. All participants were requested to complete two laboratory sessions separated by an interval of approximately 1 week. A total of 18 sessions, each having about 20 participants, was arranged. The sessions spanned 3 weeks. Measures were organized into two distinct questionnaires. Each questionnaire included one of a pair of parallel forms of measures for each construct studied (described below) in this project. In each session, participants were given one questionnaire to answer. Each participant received different questionnaires in his or her two different sessions. Sixty-eight of the initial participants failed to attend the second session, so the final sample consists of 167 participants (83 men and 84 women; average age ⳱ 21.3 years). There was no difference between the participants who dropped out and those in the final sample in terms of age and gender. Because it is possible that the dropout participants were different from those in the final sample in some other characteristics that could influence the results (e.g., those who dropped out were lower in conscientiousness, which could have resulted in some range restriction in this variable in the final sample), we carried out further analyses to examine the potential bias due to sample attrition. Mean scores of the participants who did not complete the study were compared with those in the final sample on all the available measures. No statistically significant differences were found.

Measures As mentioned above, we measured constructs from the cognitive domain, the broad personality domain, and the affective traits domain. More specifically, we measured general mental ability (GMA), the Big Five personality traits, self-esteem, self-efficacy, positive affectivity, and negative affectivity. Table 1 presents

BEYOND ALPHA Table 1 Measures Used in the Study Construct Cognitive General mental ability Broad personality Big Five (Conscientiousness, Extraversion, Agreeableness, Neuroticism, and Openness) Self-esteem Generalized self-efficacy Affective traits Positive and negative affectivity

Measure Wonderlic PCIa IPIP Big Fivea TSBI, Forms A and B Rosenberga GSE Scalea PANASa DE scalea MPIa

Note. Wonderlic ⳱ Wonderlic Personnel Test (1998); PCI ⳱ Personal Characteristic Inventory (Barrick & Mount, 1995); IPIP ⳱ International Personality Inventory Pool (Goldberg, 1997); TSBI ⳱ Texas Social Behavior Inventory (Helmreich & Stapp, 1974); Rosenberg ⳱ Rosenberg Self-Esteem Scale (1965); GSE ⳱ Sherer’s Generalized Self-Efficacy Scale (Sherer et al., 1982); PANAS ⳱ Positive and Negative Affect Schedule (Watson, Clark, & Tellegen, 1988); DE ⳱ Diener and Emmons’s (1984) AffectAdjective Scale; MPI ⳱ Multidimensional Personality Index (Watson & Tellegen, 1985). a These measures have no parallel forms. We used the method described in the text to calculate coefficient of equivalence stability reliability values.

the measures used in the study. As shown therein, we used the Wonderlic Personnel Test (Wonderlic Personnel Test Inc., 1998) to measure GMA. This test is popular among employers as a valid personnel selection tool. It has also been used in research to operationalize the GMA construct (e.g., Barrick, Stewart, Neubert, & Mount, 1998; Hansen, 1989; Mackaman, 1982). Available empirical evidence showed that the test is highly correlated with many other popular tests of GMA, such as the Wechsler Adult Intelligence Scale–Revised (Wechsler, 1981; .75–.96), Otis– Lennon (Otis & Lennon, 1979; .87–.99), and the General Aptitude Test Battery (U.S. Department of Labor, 1970; .74, Wonderlic Personnel Test, 1998). For the Big Five personality constructs, we used two inventories: The Personal Characteristics Inventory (PCI; Barrick & Mount, 1995) and the International Personality Inventory Pool Big Five scale (IPIP; Goldberg, 1997). The respective manuals of these scales show that they are highly correlated with other measures of Big Five personality constructs popularly used in research and practice, that is, the Neuroticism–Extraversion–Openness Personality Inventory—Revised (NEO-PI–R; Costa & McCrae,

215

1985), Hogan Personality Inventory (Hogan, 1986), and Goldberg’s markers (Goldberg, 1992). There have also been empirical studies using these scales in the literature (e.g., Barrick, Mount, & Strauss, 1994; Barrick et al., 1998; Heaven & Bucci, 2001). Two scales were used for the self-esteem construct: the Rosenberg Self-Esteem Scale (Rosenberg, 1965) and the Texas Social Behavior Inventory (Helmreich & Stapp, 1974). According to a recent review (Blascovich & Tomaka, 1991), these scales are among the most popularly used measures of self-esteem in the literature. Because generalized self-efficacy is a relatively new construct, there are not many scales available. We chose to use the scale developed by Sherer et al. (1982), which appears to be one of the most widely used measures of this construct. The PANAS (Watson et al., 1988) is probably the most popular measure of affective traits (Price, 1997), so we used it in our study. To obtain additional estimates of transient error for the affective constructs, we included two other measures of trait affectivity: Diener and Emmons’s (1984) Affect-Adjective Scale and the Multidimensional Personality Index (Watson & Tellegen, 1985). As shown in Table 1, only two of the measures used have parallel forms available: Wonderlic Personnel Test (Wonderlic Personnel Test, 1998) for the construct of GMA, and Texas Social Behavior Inventory (Helmreich & Stapp, 1974) for the construct of self-esteem. In the case of the other measures, we split them into halves to form parallel half scales. Efforts were made to create half scales as strictly parallel as possible (cf. Becker, 2000; Clause, Mullins, Nee, Pulakos, & Schmitt, 1998; Cronbach, 1943): Items in the measures were allocated into the halves on the basis of (a) item content, (b) their loadings on respective factors (from previous empirical studies providing this information), and (c) other relevant statistics (i.e., standard deviations, intercorrelations with other items) when available. For each measure of this type, a different half scale was included in each questionnaire to be administered on that occasion. Because of time constraints and concerns about measurement reactivity, we did not include full scales for those measures each time and apply post hoc division of scales, as discussed in the earlier section.

Analyses Two types of reliability coefficients were computed: the CE was computed with Cronbach’s (1951) formula, and the CES was computed using the method

216


detailed earlier. Specifically, for the measures with parallel forms (Wonderlic Personnel Test and Texas Social Behavior Inventory), the CE value was estimated by averaging the coefficient alphas of the parallel forms; the CES values were the correlations between the parallel forms administered on different occasions. For those measures without parallel forms, their half-scale coefficients of equivalence (ce) were estimated by averaging coefficient alphas of the parallel half scales of the same measures; their half-scale coefficients of equivalence and stability (ces) were the correlations between the parallel half scales of the same measures administered on two different occasions. The Spearman–Brown formula was then used to estimate the CE of the full scales from those of the half-scales (ce). The CES of the full scales was estimated from those of the half-scales (ces) by use of Equation 14 derived in the previous section. Next we estimated the ratio of transient error variance to observed score variance for measures of each construct by subtracting the CES from CE; this ratio or proportion is hereafter referred to as TEV. When available, TEV for measures of the same construct were averaged to obtain the best estimate of the effect of transient error on measures of that construct. Finally, we examined the effect of transient error on measures of different constructs by comparing proportions of TEV across measures and constructs. As discussed earlier, when sample size is limited and half scales are likely to be randomly parallel instead of strictly parallel, the values of CES estimated by Equation 14 could be affected by sampling error in both subjects and items. It is therefore important to have information about the distribution of the CES estimates, as well as those of TEV and CE. Unfortunately, the nonlinearity of Equation 14 and the complexity of the sampling distributions of its components (e.g., ce is estimated by averaging two coefficient alphas of the half scales) renders analytical derivation of the formula for estimating the standard error of the CES difficult. Here the Monte Carlo simulation technique can provide a tool for empirically examining the distributions of interest (cf. Mooney, 1997). We simulated data based on the results (i.e., estimates of CE, CES, and TEV) obtained from the previous analysis. One thousand data sets were generated for each measure included in the study under the same conditions (see Appendix B for further details of the simulation procedure). For each data set, the same analysis procedure described in the previous

section was carried out to estimate the CE, CES, and TEV. Standard deviations of the distributions of these estimates provide the standard errors of interest.

Results Table 2 shows the results of the study. The standard deviations given are for full-length scales. Where the observed data were for half-length scales, the standard deviations were computed using the methods presented in Appendix A. The next two columns present the CE and CES values, with the following column being the estimate of the TEV. The last column indicates the percentage by which the CE (coefficient alpha) overestimates the actual reliability. Overall, estimates of the CES are smaller than the estimates of the CE (coefficient alpha), indicating that transient error exists in scores obtained with the measures used in this study. For GMA, the CE overestimated the reliability of the measure by 6.70%; for the Big Five personality measures, the extent of the overestimation varied between 0 and 14.90% across the ten measures (two measures of each of the Big Five constructs). From the measures of the Big Five traits, Neuroticism measures contained the largest amount of transient error (TEV ⳱ .09 across the two measures), which is consistent with the affective nature of this trait (Watson, 2000). The estimated value of TEV is negative for measures of Openness to Experience and for the PCI measure of Extraversion. Because proportion of variance cannot be negative by definition, negative values must be attributed to sampling error, so we considered the TEV of these measures to be zero. Hunter and Schmidt (1990) provided a detailed discussion of the problem of negative variance estimates, a common phenomenon in meta-analysis due to second-order sampling error. Negative variance estimates also occur in analysis of variance. When a variance is estimated as the difference between two other variance estimates, the obtained value can be negative because of sampling error in the two estimates. For example, if the population value of the variance in question is zero, the probability that it is estimated to be negative will be about 50% (because about half of its sampling distribution is less than zero). The practice of considering negative variance estimates as due to sampling error and treating them as zero is also the rule in generalizability theory research (e.g., Cronbach et al., 1972; Shavelson, Webb, & Rowley, 1989). For the generalized self-efficacy construct, the CE of the Sherer et al. (1982) measure of generalized

217

BEYOND ALPHA

Table 2 Estimates of Reliability Coefficients and Proportions of Transient Error Variances Construct and measure GMA Wonderlic Conscientiousness PCI IPIP Average Extraversion PCI IPIP Average Agreeableness PCI IPIP Average Neuroticism PCI IPIP Average Openness PCI IPIP Average GSE Sherer’s GSE Self-Esteem TSBI Rosenberg Average Positive affectivity PANAS MPI DE scale Average Negative affectivity PANAS MPI DE scale Average

SDa

CE

CES

5.36

.79

.74

.05

6.7

10.09 18.08

.88 .93 .91

.78 .89 .84

.10 .04 .07

13.0 4.5 8.3

8.43 16.90

.81 .92 .87

.83 .88 .86

.00 (−.02)c .04 .02

0.0 4.5 2.2

6.93 14.82

.85 .88 .87

.80 .87 .84

.05 .01 .03

6.3 1.1 3.6

8.14 19.69

.85 .93 .89

.74 .86 .80

.11 .07 .09

14.9 8.1 11.3

6.17 15.16

.81 .88 .85

.87 .90 .89

.00 (−.05)c .00 (−.02)c .00 (−.04)c

0.0 0.0 0.0

28.83

.89

.83

.05

6.0

8.31

.85 .84 .85

.80 .79 .80

.05 .05 .05

6.3 6.3 6.3

4.96 7.94 3.72

.82 .91 .86 .86

.63 .81 .74 .73

.19 .09 .12 .13

30.2 11.1 16.2 17.8

5.78 9.33 4.17

.83 .90 .90 .88

.78 .80 .69 .76

.05 .09 .21 .11

6.4 11.2 30.4 14.5

TEV

% Overestimateb

Note. N ⳱ 167. Blank cells indicate that data are not applicable. CE ⳱ coefficient equivalence (alpha); CES ⳱ coefficient of equivalence and stability; TEV ⳱ transient error variance over observed score variance; GMA ⳱ general mental ability; Wonderlic ⳱ Wonderlic Personnel Test (1998); PCI ⳱ Personal Characteristic Inventory (Barrick & Mount, 1995); IPIP ⳱ International Personality Inventory Pool (Goldberg, 1997); GSE ⳱ Sherer’s Generalized Self-Efficacy Scale (Sherer et al., 1982); TSBI ⳱ Texas Social Behavior Inventory (Helmreich & Stapp, 1974); Rosenberg ⳱ Rosenberg Self-Esteem Scale (1965); PANAS ⳱ Positive and Negative Affect Schedule (Watson, Clark, & Tellegen, 1988); MPI ⳱ Multidimensional Personality Index (Watson & Tellegen, 1985); DE ⳱ Diener and Emmons’s (1984) Affect-Adjective Scale. a Full-scale standard deviation. bPercentage that CES is overestimated by CE. cThe estimated ratio of transient error variance to observed score variance is negative for these measures. Because variance proportion cannot be negative, zero is the appropriate estimate of its population value.

218


self-efficacy overestimated reliability by 6.00%. The overestimation was 6.30% for both measures of selfesteem. Against our expectation, measures of GMA and broad personality traits appear to have comparable TEV (.05 for GMA and an average of .04 across 7 broad personality traits; Table 2). Hence, the comparison of the amount of transient error in measures of GMA with the amounts of transient error present in measures of broad personality constructs does not support Schmidt and Hunter’s (1996, 1999) hypothesis that transitory error phenomena affect measures of ability to a lesser extent than measures of broad personality factors.4 The findings indicate that the proportion of transient error component is quite large for measures of affectivity, averaging (across the three affectivity inventories) .13 for positive affectivity and .11 for negative affectivity. These results show that the CE overestimates the reliability of PA and NA, on average, by about 18% and 15%, respectively. Consistent with our prediction, the proportion of transient error in measures of PA is largest for the PANAS measure (.19). However, among measures of NA, the PANAS contained the smallest TEV (.05), contrary to our expectation. Table 3 shows the standard errors of the CE, CES, and TEV. These values were obtained from the Monte Carlo simulation, except for those of averaged estimates, that is, the averaged CE, CES, or TEV for constructs that have more than one measure. Standard errors of these averaged estimates were calculated by the usual formula for averaged statistics.5 As can be seen therein, the standard errors for CE estimates (ranging from .008 for IPIP Conscientiousness to .023 for PCI Openness) are consistently smaller than those of the CES and TEV. Standard errors of the CES (ranging from .022 to .065) in turn are slightly but consistently larger than those of TEV (from .019 to .062). In general, scales with larger numbers of items have smaller standard errors. This finding is expected, as the estimates were affected by sampling errors in both subjects and items. Of special interest are the standard errors of the TEV estimates. These values are small, but because the TEV estimates are also small, the standard errors are somewhat large relative to the TEV estimates. Obviously, this is due to the modest sample size of the study. Nevertheless, the fact that the 90% confidence intervals of the majority of the TEV estimates (12 cases out of 20 measures included in the study) do not cover the zero point

indicates the existence of transient error in those measures.

Discussion and Conclusion This study replicates and expands the findings presented by Becker (2000). The primary implication of these findings is that the nearly universal use of the CE as the reliability estimate for measures of important and widely used psychological constructs, such as those studied in this project, leads to overestimates of scale reliability. As shown in this article, the overestimation of reliability occurs because the CE does not capture transient error. As Becker noted, few empirical investigations of the effect of transient errors go beyond computer simulations. The present study contributes to the literature on measurement by calibrating empirically the effect of transient error on the measurement of important constructs from the individual differences area and by formulating a simple method of computing the CES—the reliability coefficient that takes into account random response, transient, and specific factor error processes. With the exception of the personality trait of Open

4

This general conclusion remained essentially unchanged even after adjusting for the fact that, as might be expected, our student sample was less heterogeneous on Wonderlic scores than the general population (SDs ⳱ 5.36 and 7.50, respectively, where the 7.50 is from the Wonderlic Test Manual). We adjusted both the CE and CES using the following formula: rxxA ⳱ 1 − u2x (1 − rxxB), where ux ⳱ sample SD/population SD, rxxB ⳱ sample reliability (from Table 2), and rxxA ⳱ reliability in the general (more heterogenous population; Nunnally & Bernstein, 1994). This adjustment reduced the TEV estimate for the Wonderlic from .05 to .03. This is still not appreciably different from the average figure of .04 for the personality measures. It is to be expected that college students will be somewhat range restricted on GMA. As is usually the case, no such differences in variability were found for the noncognitive measures. 5 The following formula was used to calculate the standard errors of averaged estimates (CE, CES, or TEV):

冉兺冊 k

SEm =1 Ⲑ k

SEi2

1Ⲑ2

,

i

where SEm is the standard error of the averaged estimates; SEi is the standard error of an estimate; and k is the number of estimates.

219

BEYOND ALPHA

Table 3 Standard Errors for the Reliability Coefficents and Transient Error Variance Estimates Construct and measure GMA Wonderlic Conscientiousness PCI IPIP Average Extraversion PCI IPIP Average Agreeableness PCI IPIP Average Neuroticism PCI IPIP Average Openness PCI IPIP Average GSE Sherer’s GSE Self-Esteem TSBI Rosenberg Average Positive affectivity PANAS MPI DE scale Average Negative affectivity PANAS MPI DE scale Average

CE

SECEa

CES

SECESb

TEV

SETEVc

.79

.020

.74

.035

.05

.029

.88 .93 .91

.013 .008 .008d

.78 .89 .84

.042 .022 .024d

.10 .04 .07

.039 .019 .022d

.81 .92 .86

.020 .009 .011d

.83 .88 .86

.039 .025 .023d

.00 .04 .02

.038 .022 .022d

.85 .88 .87

.017 .014 .011d

.80 .87 .84

.041 .030 .025d

.05 .01 .03

.039 .027 .024d

.85 .93 .89

.016 .008 .009d

.74 .86 .80

.048 .027 .028d

.11 .07 .09

.045 .025 .026d

.81 .88 .85

.023 .013 .013d

.87 .90 .89

.045 .028 .027d

.00 .00 .00

.045 .027 .026d

.89

.013

.83

.035

.05

.032

.85 .84 .85

.015 .019 .012d

.80 .79 .80

.029 .044 .026d

.05 .05 .05

.023 .043 .024d

.82 .91 .86 .86

.022 .011 .018 .010d

.63 .81 .74 .73

.065 .033 .045 .029d

.19 .09 .12 .13

.062 .031 .044 .027d

.83 .90 .90 .88

.021 .011 .012 .009d

.78 .80 .69 .76

.047 .036 .049 .026d

.05 .09 .21 .11

.046 .033 .046 .024d

Note. All the standard errors were obtained from the Monte Carlo simulation. CE ⳱ coefficient equivalence (alpha); CES ⳱ coefficient of equivalence and stability; TEV ⳱ transient error variance over observed score variance; GMA ⳱ general mental ability; Wonderlic ⳱ Wonderlic Personnel Test (1998); PCI ⳱ Personal Characteristic Inventory (Barrick & Mount, 1995); IPIP ⳱ International Personality Inventory Pool (Goldberg, 1997); GSE ⳱ Sherer’s Generalized Self-Efficacy Scale (Sherer et al., 1982); TSBI ⳱ Texas Social Behavior Inventory (Helmreich & Stapp, 1974); Rosenberg ⳱ Rosenberg Self-Esteem Scale (1965); PANAS ⳱ Positive and Negative Affect Schedule (Watson, Clark, & Tellegen, 1988); MPI ⳱ Multidimensional Personality Index (Watson & Tellegen, 1985); DE ⳱ Diener and Emmons’s (1984) Affect-Adjective Scale. a Standard error of CE estimates. bStandard error of the CES estimates. cStandard error of the TEV estimates. dStandard error of the averaged estimates.

220


ness to Experience, transient error affected the measurement of all constructs measured in this study.6 Results showed that the proportion of transient error in measures of GMA is comparable to the transient error proportion in measures of broad personality traits, which was inconsistent with our expectation, on the basis of Schmidt and Hunter’s (1996) speculation that transient error is smaller in magnitude in the cognitive domain (as compared with the noncognitive domain). These findings, however, are consistent with the arguments presented by Conley (1984). It may be the case that, in the long term, personality indeed changes more than intelligence does, thus causing decreased levels of test–retest stability over long periods for personality constructs, whereas in the short term, transient error affects cognitive and noncognitive measures similarly. Future research might examine this hypothesis. As expected, in the affectivity domain the overestimation of scale reliability was substantial and the largest among the construct domains sampled in this study. For positive affectivity, we found that the CE overestimates the reliability of the measures examined in the study by about 18%. For the PANAS, the most popular measure of affectivity, the overestimate is higher, at about 30%. For negative affectivity the amount of TEV was also substantial, leading to an overestimation of the reliability of NA scores by about 15%. The extent of bias resulting from using the CE as the reliability estimate for measures of GMA and broad personality factors is relatively smaller, but it can be potentially consequential. Our estimate of the proportion of TEV in the measures of self-esteem (.05 across the two measures; Table 2) is very similar in magnitude to the .04 estimate obtained by Becker (2000). As noted earlier, we tried to analyze widely used measures for the constructs of cognitive ability, personality, and affectivity in the study. Nevertheless, it is obviously impossible for us to include all the popular measures of the constructs used by researchers in the literature. To confidently generalize the findings, researchers need more studies that are similar to this one but that use different measures of the constructs and those of different important constructs in psychology and social science. Just as with other statistical estimates, the estimates of TEV obtained here are susceptible to sampling error. As discussed earlier, sampling fluctuation obviously accounts for the negative estimates of transient error for the measures of Openness to Experience.

Results of our simulation showed that standard errors for the estimates were relatively large, as expected from the modest sample size of the current study. Consequently, specific figures obtained here should be interpreted cautiously, pending future replications. The problem of sampling error can be adequately addressed only by application of meta-analysis when sufficient studies become available (Hunter & Schmidt, 1990). Accordingly, we encourage further studies examining the effects of transient error. Recent research (e.g., Becker, 2000; Schmidt & Hunter, 1999) suggested transient error has a potentially large impact on the measurement of psychological constructs. This potential impact should be taken into account when estimating the reliability of measures and when correcting observed relationship for biases introduced by measurement error (DeShon, 1998; Schmidt & Hunter, 1999). Our study provides some initial empirical estimates of the magnitude of the problem for measures of widely used constructs. The finding that transient error indeed has nontrivial effects calls for more comprehensive treatment of measurement errors in empirical research. One obvious step toward minimizing transient error in the measures used in research is to repeatedly administer measures to the same group of subjects with appropriate time intervals and then average (or sum) the scores across all the testing occasions. TEV in the scores thus obtained will be reduced in proportion to the number of testing occasions (Feldt & Brennan, 1989). However, a large number of testing occasions may be needed to virtually eradicate the effect of transient error, and the majority of research projects probably do not have the resources to do this. We therefore need a more general approach to correction for the biasing effects of measurement errors in general and that of transient error in particular. Studies like this one specifically designed to estimate the CES of measures of important psychological constructs can pro-

6

Our estimates of TEV (and those of Becker, 2000) are actually slight underestimates. Strictly speaking, coefficient alpha estimates the random CE rather than the strict parallelism CE (Cronbach, 1951; Nunnally & Bernstein, 1994). Hence coefficient alpha can be expected to be slightly smaller than the strictly parallel CE, and therefore the quantity alpha—CES slightly underestimates TEV. This difference, however, is small and is reduced somewhat by the fact that the measures used in computing the CES may have fallen somewhat short of strict parallelism.

BEYOND ALPHA

vide the needed reliability estimates to be subsequently used to correct for measurement error in observed correlations between measures.7 The cumulative development of such a “database” for the CES estimates of measures of widely used psychological constructs can thus enable substantive researchers to obtain unbiased estimates of construct-level relationships among their variables of interest and to minimize the research resources needed. Rothstein (1990) presented a precedent for this general approach: She provided large-sample meta-analytically derived figures for interrater reliability of supervisory ratings of overall job performance that have subsequently been used to make corrections for measurement error in ratings in many published studies. A similar CES database could serve this same purpose for widely used measures of individual differences constructs. 7

As is well known, reliability varies with sample heterogeneity: Reliability is higher in samples (and populations) in which the scale standard deviation (SDx) is larger. However, given the reliability and the SDx in one sample, a simple formula (see Footnote 4) is available for computing the corresponding reliability in another sample with a different SDx (Nunnally & Bernstein, 1994).

References Anastasi, A. (1988). Psychological testing (6th ed.). New York: Macmillan. Barrick, M. R., & Mount, M. K. (1995). The Personal characteristics inventory manual. Unpublished manuscript, University of Iowa, Iowa City. Barrick, M. R., Mount, M. K., & Strauss, J. P. (1994). Antecedents of involuntary turnover due to a reduction in force. Personnel Psychology, 47, 515–535. Barrick, M. R., Stewart, G. L., Neubert, M. J., & Mount, M. K. (1998). Relating member ability and personality to work-team processes and team effectiveness. Journal of Applied Psychology, 83, 377–391. Becker, G. (2000). How important is transient error in estimating reliability? Going beyond simulation studies. Psychological Methods, 5, 370–379. Blascovich, J., & Tomaka, J. (1991). Measures of selfesteem. In J. P. Robinson, P. R. Shaver, & L. S. Wrightsman (Eds.), Measures of personality and social psychological attitudes (pp. 115–160). San Diego, CA: Academic Press. Brief, A. P. (1998). Attitudes in and around organizations. Thousand Oaks, CA: Sage. Buss, A. H., & Perry, M. (1992). The Aggression Question-

221

naire. Journal of Personality and Social Psychology, 63, 452–459. Clause, C. S., Mullins, M. E., Nee, M. T., Pulakos, E., & Schmitt, N. (1998). Parallel test form development: A procedure for alternate predictors and an example. Personnel Psychology, 51, 193–208. Conley, J. J. (1984). The hierarchy of consistency: A review and model of longitudinal findings on adult individual differences in intelligence, personality, and self-opinion. Personality and Individual Differences, 5, 11–25. Costa, P. T., & McCrae, R. R. (1985). The NEO Personality Inventory manual. Odessa, FL: Psychological Assessment Resources. Cronbach, L. J. (1943). On estimates of test reliability. Journal of Educational Psychology, 34, 485–494. Cronbach, L. J. (1947). Test reliability: Its meaning and determination. Psychometrika, 12, 1–16. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley. DeShon, R. P. (1998). A cautionary note on measurement error corrections in structural equation models. Psychological Methods, 3, 412–423. Diener, E., & Emmons, R. A. (1984). The independence of positive and negative affect. Journal of Personality and Social Psychology, 47, 1105–1117. Feldt, L. S. (1965). The approximate sampling distribution of Kuder–Richardson reliability coefficient twenty. Psychometrika, 30, 357–370. Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105– 146). New York: Macmillan. Goldberg, L. R. (1992). The development of markers for the Big-Five factor structure Psychological Assessment, 4, 26–42. Goldberg, L. R. (1997). A broad-bandwidth, public-domain, personality inventory measuring the lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F. De Fruyt, & F. Ostendorf (Eds.), Personality psychology in Europe (Vol. 7, pp. 7–28). Tilburg, the Netherlands: Tilburg University Press. Hansen, C. P. (1989). A causal model of the relationship among accidents, biodata, personality, and cognitive factors. Journal of Applied Psychology, 74, 81–90. Heaven, P. C. L. & Bucci, S. (2001). Right-wing authoritarianism, social dominance orientation and personality: An analysis using the IPIP measure. European Journal of Personality, 15, 49–56.

222


Helmreich, R., & Stapp, J. (1974). Short forms of the Texas Social Behavior Inventory (TSBI): An objective measure of self-esteem. Bulletin of the Psychonomic Society, 4, 473–475. Hogan, R. (1986). Hogan Personality Inventory manual. Minneapolis, MN: National Computer Systems. Hunter, J. E., & Schmidt, F. L. (1990). Methods of metaanalysis: Correcting error and bias in research findings. Newbury Park, CA: Sage. Judge, T. A., & Bretz, R. D. (1993). Report on an alternative measure of affective disposition. Educational and Psychological Measurement, 53, 1095–1104. Le, H., & Schmidt, F. L. (2001, August). The multi-faceted nature of measurement error and its implications for measurement error corrections. Paper presented at the 109th Annual Convention of the American Psychological Association, San Francisco, CA. Lord, F. M. (1955). Sampling fluctuations resulting from the sampling of test items. Psychometrika, 20, 1–22. Mackaman, S. L. (1982). Performance evaluation tests for environmental research: Wonderlic Personnel Test. Psychological Reports, 51, 635–644. Mooney, C. Z. (1997). Monte Carlo simulation. Newbury Park, CA: Sage. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory. (3rd ed.): McGraw-Hill. Otis, A. S., & Lennon, R. T. (1979), Otis–Lennon School Ability Test. New York: The Psychological Corporation. Price, J. L. (1997). Handbook of organizational measurement. International Journal of Manpower, 18, 303–558. Rosenberg, M. (1965). Society and the adolescent selfimage. Princeton, NJ: Princeton University Press. Rothstein, H. R. (1990). Interrater reliability of job performance ratings: Growth to asymptote level with increasing opportunity to observe. Journal of Applied Psychology, 75, 322–327. Schmidt, F. L. (1999, March 25). Measurement error and cumulative knowledge in psychological research. Invited address presented at Purdue University, West Lafayette, IN. Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios, Psychological Methods, 1, 199–223.

Schmidt, F. L., & Hunter, J. E. (1999). Theory testing and measurement error. Intelligence, 27, 183–198. Shavelson, R. J., Webb, N. M., & Rowley, G. L. (1989). Generalizability theory. American Psychologist, 44, 922– 932. Sherer, M., Maddux, J. E., Mercandante, B., Prentice-Dunn, S., Jacobs, B., & Rogers, R. W. (1982). The self-efficacy scale. Psychological Reports, 76, 707–710. Stanley, J. C. (1971). Reliability. In R. L. Thorndike (Ed.), Educational measurement (pp. 356–442). Washington, DC: American Council on Education. Thorndike, R. L. (1951). Reliability. In E. F. Lindquist (Ed.), Educational measurement (pp. 560–620). Washington, DC: American Council on Education. Traub, R. E. (1994). Reliability for the social sciences: Theory and applications. Thousand Oaks, CA: Sage. U.S. Department of Labor. (1970). Manual for the USES General Aptitude Test Battery, Section III: Development. Washington, DC: Manpower Administration, U.S. Department of Labor. Vaidya, J. G., Gray, E. K., Haig, J., & Watson, D. (2001). On the differential stability of traits: Comparing trait affect and Big Five scales. Manuscript submitted for publication. Watson, D. (2000). Mood and temperament. New York: Guilford Press. Watson, D., & Clark, L. A. (1994). The PANAS-X: Manual for the Positive and Negative Affect Schedule—expanded form. Iowa City: University of Iowa. Watson, D., Clark, L. A., & Tellegen, A. (1988). Development and validation of brief measures of positive and negative affect: The PANAS scales. Journal of Personality and Social Psychology, 54, 6, 1063–1070. Watson, D., & Tellegen, A. (1985). Toward a consensual structure of mood. Psychological Bulletin, 98, 219–235. Wechsler, D. (1981). Manual for the Wechsler Adult Intelligence Scale—Revised. New York: The Psychological Corporation. Wonderlic Personnel Test. (1998). Wonderlic Personnel Test & Scholastic Level Exam, User’s manual. Northfield, IL: Wonderlic and Associates.

223

BEYOND ALPHA

Appendix A Estimating the Standard Deviation of a Full Scale From Its Subscale Standard Deviation

We need a general method which can enable us to estimate the standard deviation of a full scale from values of subscales with any number of items. We derive such a method here. Our derivations assume all the values come from an unlimited sample size, that is, they are population parameters. We further adopt the item domain model (Nunnally & Bernstein, 1994) that assumes that all items of a scale are randomly sampled from a hypothetical pool of an infinite number of items. Consider Scale A1 with k1 items. Assuming that we have the standard deviation of A1 (␴A1) and its coefficient alpha (␣1), we need to derive a formula to estimate the standard deviation of Scale A2 that has k2 number of items from the same item domain. Following the item domain model, items from Scales A1 and A2 are randomly sampled from the same hypothetical item domain. The variance of A1 can be written as a function of the average variance and intercorrelation of a single item as follows: ␴2A1 ⳱ k1␴2 + k1(k1 − 1)␴2␳,

(A1)

where ␴ is the average standard deviation of one item, and ␳ is the average intercorrelation among all the items included in the scale. Similarly, the variance of A2 can be obtained from the average variance and intercorrelation of a single item: ␴2A2⳱k2␴2 + k2(k2 − 1)␴2␳.

(A2)

Factoring Equations A1 and A2 above for k1␴2 and k2␴2, respectively, we obtain ␴2A1 ⳱ k1␴2[1 + (k1 − 1)␳]

(A3)

␴2A2 ⳱ k2␴2[1 + (k2 − 1)␳].

(A4)

␴2 ⳱ ␴2A1 [k1 − (k1 − 1)␣1]/k 21.

(A7)

Now replacing the values of ␳ in Equation A6 into Equation A4 and simplifying, we obtain Equation A8: ␴2A2 ⳱ k2␴2 [k1 − ␣1(k1 − k2)]/[k1 − (k1 − 1)␣1].

(A8)

Next replacing the value of ␴2 in Equation A7 into Equation A8 above and simplifying, we have the equation of interest: 2 ␴A2 =

k2 k12

2 ␴A1 关k1 − 共k1 − k2兲␣1兴.

(A9)

Taking the square root of both sides of Equation A9 yields the formula used to compute the standard deviation of a scale having k2 items from the standard deviation and coefficient alpha of its subscales having k1 items: ␴A2 =

␴A1 兵k2 关k1 − 共k1 − k2兲␣1兴其1 Ⲑ 2 . k1

(A10)

As can be seen from the derivations above, this formula can be used to compute the standard deviation of one scale from the standard deviation and coefficient alpha of any other scale if the two scales sample items from the same item domain. In practice, the values of ␴A1 and ␣1 are estimated and therefore subject to sampling error. Consequently, the estimate of ␴A2 is also subject to sampling error. Using the circumflex to denote all the values are estimated, we can rewrite Equation A10 as follows: ␴ˆ A2 =

␴ˆ A1 兵k2 关k1 − 共k1 − k2兲␣ˆ 1兴其1 Ⲑ 2 . k1

(A10⬘)

and

Solving Equation A3 for ␴2, we have ␴2 ⳱ ␴2A1/[k1 + k1(k1 − 1)␳].

(A5)

The average intercorrelation among all items (␳) can be computed from the coefficient alpha of Scale A1 (␣1) by the reversed Spearman–Brown formula: ␳ ⳱ ␣1/[k1 − (k1 − 1)␣1].

To reduce the impact of sampling error, we should average the values of the subscales. Equation A10⬘ was used in our study to estimate the standard deviations of the full scales of PCI Extraversion, PCI Openness, Sherer’s GSE, and MPI Positive Affectivity and Negative Affectivity. If we call p1 the ratio of number of items in the subscale A1 to that of the full scale A2 (i.e., p1 ⳱ k1/k2), Equation A9 above can be alternatively written as follows: ␴2A2 ⳱ ␴2A1 [p1 − (p1 − 1)␣1]/p12.

(A6)

Substituting the value for ␳ in Equation A6 into Equation A5 and simplifying the result, we have

(A11)

Equation A11 above is the same as Equations 18a and 18b in the text.

(Appendixes continue)

224


Appendix B Simulation Procedure to Estimate Standard Errors of CE, CES, and TEV

Data were simulated for each subject on each test item. A subject’s score on an item of a test includes four components: xpij ⳱ tpi + spi + opj + epij,

(B1)

where xpij is the observed score of subject p on item i on occasion j, tpi is the true score of subject p on item i, spi is the specific factor error of item i (interaction between subject p and item i), opj is the transient error on occasion j (interaction between subject p and occasion j), and epij is the random response error of subject p on item i on occasion j. The components on the right side of Equation B1 can be determined on the basis information (reliability coefficients) of the scale that includes the item i. Specifically, let us consider a subject’s score on a scale X that has k items. It is the sum of his or her scores across k items. From Equation B1 above, we have k

Xpj =

兺共t

+ spi + opj + epij兲.

pi

(B2)

i=1

Variance (across subjects) of scale X is then k

∑t

Var(X兲 = Var

pi

k

∑s

+ Var

i=1

pi

i=1

k

∑o

+ Var

pj

k

∑e

+ Var

i=1

pij.

(B3)

i=1

Because spi and epij are not correlated across items (i.e., Cov(si, si⬘) ⳱ 0; and Cov(eij, ei⬘j) ⳱ 0, with all i and i⬘), Equation B3 can be expanded as Var(X) ⳱ k2Var(t) + kVar(s) + k2Var(o) + kVar(e), (B4) where Var(t) is the true score variance of an item, Var(s) is the variance of specific factor error of an item, Var(o) is the variance of transient error of an item, and Var(e) is the variance of random response error of an item. By definition, the components on the right side of Equation B4 are true score variance 共TV兲 = k2Var共t兲,

(B5)

specific factor error variance 共SEV兲 = kVar共s兲,

(B6)

transient error variance (TEV) = k2Var(o),

(B7)

random response error variance (REV) = kVar(e).

(B8)

and Standardizing X (i.e., letting Var(X) ⳱ 1), we have TV = CES,

(B9)

TEV = CE − CES,

(B10)

and SEV + REV = 1 − CE,

(B11)

where CES is the coefficient of equivalence and stability of scale X and CE is the coefficient of equivalence of scale X. From Equation B5 to Equation B11, the variance components for an item can be determined by the reliability coefficients (CE and CES) of the scale it is included in: Var(t) = CESⲐk2,

(B12)

Var共o兲 = 共CE − CES兲 Ⲑ k ,

(B13)

Var共s兲 = SEV Ⲑ k,

(B14)

Var共e兲 = (1 − CE − SEV)Ⲑk.

(B15)

2

and The value of SEV in Equation B14 above can be any positive number that is smaller than 1 − CE (we tested different values of SEV within that range and obtained virtually the same results). The reliability coefficients (CE and CES) and number of scale items (k) can be obtained for each measure included in the study (Table 2). With the values of Var(t), Var(o), Var(s) and Var(e) estimated from Equations B12 to B15 above, we can simulate a score for each subject on each item on each occasion. Specifically, the following equation was used: xpij ⳱ z1[Var(t)]1/2 + z2[Var(s)]1/2 + z3[Var(o)]1/2 + z4[Var(e)]1/2, (B16) where z1, z2, z3, and z4 were randomly drawn from four independent standard normal distributions. In total, scores for p subjects ( p ⳱ 167), on k items (k ⳱ number of items of the scale examined), on two occasions were simulated. For a subject, the same values of z1 and z3 were used across items, and the same value of z2 (on each item) was used across occasions. Two parallel half scales were then created based on the simulated data. Relevant statistics (coefficient alphas and correlation between the half scales) were then computed. The program used these data to calculate CE, CES, and TEV for each scale included in the study following the method described in the text. One thousand data sets were simulated and standard deviations of the distributions of estimated CE, CES, and TEV were recorded. These are the standard errors of interest. A computer program (in SAS 8.01) was written to implement the simulation procedures described above. The program is available from Frank L. Schmidt upon request.

Received September 5, 2001 Revision received October 7, 2002 Accepted January 13, 2003 ■