Establishing Measurement Equivalence and Invariance in ...

19 downloads 41 Views 268KB Size Report
Adam W. Meade. Department of Psychology. North Carolina State University. Gary J. Lautenschlager and Janet E. Hecht. Department of Psychology. University ...
INTERNATIONAL JOURNAL OF TESTING, 5(3), 279–300 Copyright © 2005, Lawrence Erlbaum Associates, Inc.

Establishing Measurement Equivalence and Invariance in Longitudinal Data With Item Response Theory Adam W. Meade Department of Psychology North Carolina State University

Gary J. Lautenschlager and Janet E. Hecht Department of Psychology University of Georgia

If measurement invariance does not hold over 2 or more measurement occasions, differences in observed scores are not directly interpretable. Golembiewski, Billingsley, and Yeager (1976) identified 2 types of psychometric differences over time as beta change and gamma change. Gamma change is a fundamental change in thinking about the nature of a construct over time. Beta change can be described as respondents’ change in calibration of the response scale over time. Recently, researchers have had considerable success establishing measurement invariance using confirmatory factor analytic (CFA) techniques. However, the use of item response theory (IRT) techniques for assessing item parameter drift can provide additional useful information regarding the psychometric equivalence of a measure over time that is not attainable with traditional CFA techniques. This article marries the terminology commonly used in CFA and IRT techniques and illustrates real advantages for identifying beta change over time with IRT methods rather than typical CFA methods, utilizing a longitudinal assessment of job satisfaction as an example. Key words: beta change, item response theory, measurement invariance, measurement equivalence, parameter drift

Researchers have long been interested in measuring changes in traits and attitudes over time. For years, researchers have assumed that to appropriately measure Correspondence should be addressed to Adam W. Meade, Department of Psychology, Campus Box 7650, North Carolina State University, Raleigh, NC 27695–7650. E-mail: [email protected]

280

MEADE, LAUTENSCHLAGER, HECHT

change in a variable over time, one must simply administer a measure of that variable at two points in time and compute the difference between the two observed scores. However, in the past decades, many potential problematic issues in the use of difference scores have been identified (e.g., Cronbach, 1992; Cronbach & Furby, 1970; Edwards, 1994, 2001; although see Zumbo, 1999, for a somewhat more balanced review). To adequately assess change over time when comparing two observed scores, it is essential that a measure is perceived and used in the same way by individuals at the two points in time. In other words, it is necessary to show that the two measurement occasions are psychometrically equivalent (or invariant) to validly assess change over time using difference scores (Schmitt, 1982). Similarly, to make veridical comparisons of test takers over time, the psychometric properties of a test must not change over time. Changes in the psychometric properties of a test over time could change the reliability or predictive validity of a test (e.g., Alvares & Hulin, 1972; Henry & Hulin, 1987). Although multiple methods exist for establishing measurement equivalence/invariance (ME/I; Vandenberg, 2002), they have evolved in isolation using different terminology to represent identical concepts. These methods center around two methodological paradigms: confirmatory factor analytic (CFA) techniques and item response theory (IRT) methods. In this article, we explicitly link the terminology used in both paradigms, contrast the differing information obtained with these methods, and illustrate the IRT techniques with a longitudinal measure of job satisfaction. Although the methodology utilized in this study is not new, we believe that we are the first to point out distinct advantages of IRT methods over CFA techniques of establishing measurement invariance specifically for longitudinal measures.

LONGITUDINAL MEASUREMENT INVARIANCE Assessment of change is crucial in many lines of research. It is imperative for researchers and practitioners in educational and employment testing, program evaluation, training, human resources, and organizational development to be able to properly determine if there truly is change over time with respect to a variable of interest. In the past, researchers simply calculated the difference in observed scores at two points in time to determine if a trait had changed. However, Golembiewski, Billingsley, and Yeager (1976) identified three possible types of change (alpha, beta, and gamma change) that could result in differences in observed scores over time. Beta change is defined as change that results from respondents’ recalibration of the measurement scale over time. Beta change may be most appropriately conceptualized by a lengthening or shortening of the intervals between scale points by respondents over time even though the conceptualization of the construct itself re-

IRT AND MEASUREMENT INVARIANCE

281

mains constant. If beta change has occurred, a respondent with the same degree of a latent trait may choose different item response categories on different measurement occasions because of his or her recalibration of the response scale. Experiences between Time 1 and Time 2 can alter how the rating scale is interpreted and thus change how the respondents perceive a difference between the intervals underlying the scale. For example, a rating of 4 (on a 5-point scale) at Time 2 may be interpreted by the respondent in the same way as was a rating of 3 at Time 1 (Vandenberg & Self, 1993). Ample evidence exists in the social judgment literature to suggest the existence and importance of context effects on judgment (Parducci, 1968). Furthermore, previous research (Upshaw, 1969) identified ways in which an individual’s personal reference scale can be altered in judgment units over the course of experience. Gamma change can be conceptualized as a fundamental change over time in respondents’ understanding and definition of the latent variable being assessed. In other words, if gamma change has occurred, the meaning of the construct has changed for the respondents over time. For example, respondents’ understanding of what constitutes diversity within an organization may change after undergoing diversity awareness training. Finally, alpha change is defined as a true change in an underlying latent variable. To assess alpha change using observed scores, a scale must show ME/I over time. In other words, alpha change can be inferred from observed scores only in the absence of beta and gamma change. In the decades since Golembiewski et al.’s (1976) triumvirate conceptualization of longitudinal change, there has been some debate over the best way to assess beta and gamma change. Recently, however, researchers have primarily relied on CFA methods to assess beta and gamma change for polytomous (e.g., Likert scale) data. These methods of testing for ME/I over time are similar to the methods used in testing for measurement invariance among different groups of participants (Taris, Bok, & Meijer, 1998). In CFA studies of alpha, beta, and gamma change, several nested model comparisons are conducted. Using these methods, gamma change is indicated when different numbers of factors or different factor patterns are found between two measurement occasions. If gamma change is not present, beta change can be identified by differences in factor loadings across measurement occasions or by differences in factor variances across measurement occasions (Schaubroeck & Green, 1989; Schmitt, 1982; Taris et al., 1998; Vandenberg & Self, 1993). Although differences in factor variances typically are not considered crucial for establishing ME/I in multigroup comparisons, they traditionally have been considered essential for establishing beta change in the longitudinal ME/I literature. Presumably this is because differences in use of the latent continuum by the same individuals over time represent a change in the latent metric individuals use when responding to items.1 1Note that we do not argue that a test of factor variances should constitute a test of beta change, only that historically, this has been the case.

282

MEADE, LAUTENSCHLAGER, HECHT

Tests of gamma change and beta change are sequential such that gamma change must be ruled out before tests of beta change are performed.

IRT FRAMEWORK Although many researchers have found CFA techniques useful for assessing gamma and beta change, additional information regarding the properties of a measure can be determined using IRT methods (Maurer, Raju, & Collins, 1998; Raju, Laffitte, & Byrne, 2002). IRT methods utilize a nonlinear, monotonic model to relate some latent trait (θ) to the probability of a specific item response. With IRT, each item is described by a set of parameters that can be used to graphically depict the relation between an item and a latent trait through use of an item characteristic curve (ICC; see Figure 1). The a parameter is proportional to the slope of the relationship between a latent trait, θ, and the conditional probability of response given the level of a latent trait, P(θ). This parameter reflects the degree to which the item discriminates between persons with different levels of θ. In addition to a parameters, IRT analyses also utilize b parameters to model the relation between the latent trait and probability of response. For dichotomous items in which each item is simply scored 0 or 1 (typically corresponding to incorrect and correct item scores on an ability test), the b parameter represents item difficulty. For dichotomous items, the b parameters represent the point on the θ scale at which there is a 50% probability that an examinee with that corresponding θ level will get the item correct (for the Rasch and two parameter models).

FIGURE 1

Item characteristic curve for dichotomous items.

IRT AND MEASUREMENT INVARIANCE

283

Graphically, the b parameter determines how far left or right on the θ scale the ICC is positioned. Although originally developed to enhance understanding of dichotomously scored items, IRT techniques have been adapted to the context of polytomous items (e.g., Likert scales) with the graded response model (GRM; Samejima, 1969). Under the GRM, there are one fewer b parameters than there are item response categories. These b parameters represent category thresholds that define the horizontal location of the operating characteristic curves (OCCs) that reflect the point on the θ scale at which there is a 50% probability that a respondent with a corresponding θ level will assign a rating at or above a given category on the observed scale (Embretson & Reise, 2000; see Figure 2 for an example of OCCs for an item with five response options). This information is clearly useful in examining beta change over time because it allows a direct mapping of the probability of responding at or above a given category on the observed scale to the underlying trait of the respondent. Note that other IRT models are available for estimating the latent trait to observed response relation (e.g., Muraki’s [1990] Rating Scale model). However, the GRM seems to best model what can be thought of as an operationalization of beta change—a change in the distance between response options boundaries on the latent continuum. For some time, researchers of ability testing have successfully used IRT to identify items that function differently in groups of people (e.g., minorities; Camilli & Shepard, 1994; gender; Reise, Smith, & Furr, 2001; raters; Maurer et al., 1998) and over time (e.g., K. Y. Chan, Drasgow, & Sawin, 1999). Whereas assessment of ME/I over time has typically been referred to in the alpha, beta, and gamma change

FIGURE 2

Operating characteristic curves for polytomous (Likert-type) items.

284

MEADE, LAUTENSCHLAGER, HECHT

context in CFA tests, tests of ME/I in IRT contexts have been more commonly referred to as tests of item parameter drift (Bock, Muraki, & Pfeiffenberger, 1988; Goldstein, 1983). These IRT techniques for detecting item parameter drift involve tests of differential item functioning (DIF), which are conceptually comparable to CFA methods of establishing measurement invariance. Typically, these IRT techniques compare the ICCs for the same item under two conditions (e.g., over time; K. Y. Chan et al., 1999). However, it seems that no prior research has used IRT methods explicitly for identifying beta change for polytomous Likert-type data.

COMPARISONS OF IRT AND CFA METHODS OF DETECTING BETA CHANGE CFA and IRT methods of establishing measurement invariance are equivalent in purpose. Both strive to ensure that a measure is psychometrically the same, both model the relation between observed and latent scores, and both can be used to estimate the extent of a lack of invariance when it is present (Raju et al., 2002; Reise, Widaman, & Pugh, 1993). Furthermore, both can utilize nested model comparisons and use chi-square tests to provide parametric tests of model fit.2 However, there are some important differences between the CFA and IRT methodologies of establishing measurement invariance. First, whereas the CFA approach uses a linear model to describe the item to latent trait relation, IRT methods do not model a linear relation between item response and the underlying latent trait (Lord, 1953, 1980; Raju et al., 2002). Perhaps more important and underemphasized in published comparisons of the two methodologies, the CFA and IRT approaches differ in the amount and type of information that each can provide. Using the CFA approach, if the same simple structure fits the data across multiple time periods, gamma change is ruled out, and beta change is then investigated. Typically, IRT DIF analyses are unable to examine the relations between multiple latent variables, and thus, gamma change cannot be ruled out via unidimensional IRT analyses on their own. As such, unidimensional IRT analyses can only effectively deal with beta change. Although it may at some point be possible to rule out gamma change with multidimensional IRT, we are unaware of any demonstrated means for doing so. Using CFA analyses, once gamma change is ruled out, beta change is then typically assessed by examining the equivalence of factor variances over time and also by examining the equivalence of factor loadings over time (see Vandenberg & Self, 1993, for details). In IRT analyses, typically the unidimensionality of a data set is established via exploratory factor analytic (EFA) methods before conducting IRT 2Although approximations of chi-square difference tests do not exist for all IRT DIF detection methods, they are available for the methods used in this study.

IRT AND MEASUREMENT INVARIANCE

285

analyses (see Reise et al., 2001, for an example). Drawing a parallel with IRT analyses, CFA tests of factor loadings would be conceptually equivalent to testing the invariance of the items’ a parameters over time (Maurer et al., 1998; McDonald, 1999). Figure 3 illustrates the case in which an item’s a parameter might differ across measurement occasions. In addition to items’ a parameter differing over time, the IRT paradigm also can illustrate differences in items’ b parameters over time (in the IRT GRM). An example of an item with different b parameters over time can be found in Figure 4. Importantly, this difference in an item’s b parameters over time cannot be borne out via commonly used CFA methods. The most similar CFA-based parameters are item intercepts (D. Chan, 2000), which are only occasionally tested for a lack of invariance in practice (Vandenberg & Lance, 2000).3 More importantly, only one intercept parameter is modeled per item in CFA analyses, whereas with the IRT based GRM, one fewer b parameters than the number of response options is modeled for each item. These additional estimated parameters per item provide much more information regarding the psychometric properties of the item and thus a more stringent test of ME/I at the item level than do the corresponding CFA tests. Given that beta change can be defined as the lengthening or shortening of intervals between item response options over time, the IRT methods provide important information not obtainable with CFA methods. As seen in Figure 4, differences in items’ b parameters over time seem to graphically represent the very definition of beta change. This stands in contrast to differences in items’ a parameters over time (as in Figure 3), which do not conceptually fit the definition of beta change as well. Only through IRT are the differences in Figure 4 detectable, whereas the differences in Figure 3 can be detected via either IRT tests or CFA tests of factor loading equivalence. Note that these analyses are not specific to longitudinal data; theycan and have also been used for cross-sectional data. However, the terminology traditionally associated with longitudinal measurement invariance, namely, gamma and beta change, is particularly useful for discussing the differences in these methodologies.

JOB SATISFACTION Job satisfaction has long been a topic of interest, as it is related to many individual and organizational outcomes. For individuals, job satisfaction can be considered important as an end in itself. A large portion of an individual’s life is spent on the job, and as such, overall job satisfaction can influence overall life satisfaction (Judge & Locke, 1993; Lance, Lautenschlager, Sloan, & Varca, 1989; Locke, 1976; Tait, 3Analysis of item intercepts and latent factor means in a CFA analysis is sometimes referred to as a mean and covariance structures analysis (MACS). Our reference of CFA tests in this article is to the broad family of tests that includes MACS as a subset.

FIGURE 3 Operating characteristic curves (OCCs) of items exhibiting a parameter differential item functioning. Time 2 OCCs are the dashed line.

FIGURE 4 Operating characteristic curves of items exhibiting b parameter differential item functioning. Time 2 OCCs are the dashed line.

286

IRT AND MEASUREMENT INVARIANCE

287

Padgett, & Baldwin, 1989). From an organizational perspective, job satisfaction is important, as it contributes to other work attitudes and to work outcomes. Although some studies have demonstrated that the relation between job satisfaction and job performance is minimal (e.g., Iaffaldano & Muchinsky, 1985), a recent reanalysis and considerable extension of that meta-analysis suggested that the population relation between overall job satisfaction and job performance is around .30 (Judge, Thoresen, Bono, & Patton, 2001). Job satisfaction is also directly associated with job withdrawal behaviors such as absenteeism, tardiness, task avoidance, and the use of work hours for personal tasks (Beehr & Gupta, 1978; Hulin, Roznowski, & Hachiya, 1985; Judge & Locke, 1993; Youngblood, 1984). In addition, numerous studies have demonstrated a strong relation between job satisfaction and both turnover intentions and turnover (Arnold & Feldman, 1982; Bluedorn, 1979; Farkas & Tertick, 1989; Mobley, 1977; Tett & Meyer, 1993; Williams & Hazer, 1986). One limitation in the job satisfaction literature has been the pervasive use of cross-sectional methodology. However, it would seem as though longitudinal assessment would be much more common in organizational settings than the literature would suggest, as frequent employee satisfaction (job satisfaction) surveys are common. In these assessments, the measurement of job satisfaction has been presumed to demonstrate measurement invariance over time. However, few, if any, studies have focused on this issue. Measurement invariance in job satisfaction is very important from an organizational perspective. Human resource policies are designed to enhance the work environment for employees as a whole. If the construct of job satisfaction changes for workers over time, then meaningful changes in this important attitude are not readily measured. Furthermore, the best way to determine precursors and outcomes of job satisfaction is through longitudinal assessments of the construct. However, it is imperative that measurements across occasions be equivalent to infer meaningful differences in observed score differences in longitudinal assessments of job satisfaction. The IRT methods we illustrate following provide a comprehensive method of establishing measurement invariance for any longitudinal measure. Although we are illustrating these techniques with a longitudinal measure of job satisfaction, they are no less important or useful for examining changes in test properties over time. If test users intend to track changes in the scores of test takers, refer to common norms, or ensure that the test maintains its predictive validity over time, ensuring the ME/I of the test is essential.

METHOD Participants Participants in this study were 182 men and 184 women who were freshman at a large university in the southeastern United States in either 1968 or 1970 that subse-

288

MEADE, LAUTENSCHLAGER, HECHT

quently graduated from the university. This study is part of a larger longitudinal study that began during the participants’ freshman year at the university and has continued throughout their work career (see Mumford, Stokes, & Owens, 1990, for a more detailed discussion). In this study, we examined job satisfaction at two points in time. The first questionnaire was administered in 1980 (Time 1). The second measure of job satisfaction was administered in 1995 (Time 2), 15 years after the first measure. All graduates who were freshmen in either 1968 or 1970 and who were on file in the alumni office were mailed copies of the survey. The response rate for the 1980 administration was approximately 60%, whereas the response rate of the 1995 administration was near 30%. The ethnic composition of the sample was almost exclusively White. Furthermore, we only used data for which participants responded to both the Time 1 and Time 2 measures in this study. Measures An eight-item job satisfaction measure was adopted from the Post-College Experiences Inventory (PCEI; see Mumford et al., 1990) and was used at both points in time. The items comprised a 5-point Likert-type scale with response alternatives ranging from 1 (strongly disagree) to 5 (strongly agree). In this sample (N = 366), the internal consistency for Time 1 was α = .76 and for Time 2 was α = .80. See Table 1 for item content, means, standard deviations, and response frequencies at Time 1 and Time 2. This scale was developed as a measure of overall job satisfaction; however, individual scale items appear to reflect a single-item, facet-based approach to assessing satisfaction. Analyses An important assumption of IRT methods is that the data being analyzed is unidimensional. Establishing unidimensionality seemed particularly important in this instance, as the PCEI items measure somewhat distinct aspects of job satisfaction. To ensure that the scale used in these analyses was unidimensional, EFA using principal axis factoring with squared multiple correlations for communality estimates was performed on the scale at both measurement occasions. The resulting three largest eigenvalues for the Time 1 analyses were 2.39, 0.58, and 0.08. For Time 2, the three largest eigenvalues were 2.69, 0.34, and 0.13. Furthermore, all eight items loaded onto the first factor with loadings of at least 0.4 for both time periods. Given the large discrepancy between the first and second eigenvalues at both points in time, the scale was assumed sufficiently unidimensional at both Time 1 and Time 2 (see Reise et al., 2001, for similar methods and interpretations). Equally important is the fact that although IRT analyses may detect differences in item functioning over time, it cannot be determined whether these differences are due to gamma change or beta change. However, the scale proved

IRT AND MEASUREMENT INVARIANCE

289

TABLE 1 Item Means, Standard Deviations, and Response Frequencies for Time 1 and Time 2 Item Response Options

Item 1. The reputation and integrity of the organization and its management 2. Your relationship with your immediate supervisor 3. Your compensation and benefits 4. Your long-term prospects and opportunity for advancement 5. The nature of the work itself (i.e., the job activities) 6. The opportunity for individual discretion and responsibility 7. Your working conditions (illumination, ventilation, cleanliness, etc.) 8. Your relationships with your fellow workers

Time Period

1 or 2

3a

4

5

M

SD

Time 1

50

29

146

141

3.99

1.12

Time 2 Time 1

50 40

37 42

162 133

117 151

3.91 4.05

1.07 1.06

Time 2 Time 1 Time 2 Time 1

31 91 69 88

43 78 73 81

155 125 173 121

137 72 51 76

4.06 3.43 3.54 3.43

0.78 1.17 1.00 1.21

Time 2 Time 1

86 28

93 38

143 167

44 133

3.34 4.08

1.07 0.94

Time 2 Time 1

33 —

51 55

183 147

99 164

3.93 4.21

0.92 0.92

Time 2 Time 1

— 41

49 54

159 155

158 116

4.25 3.92

0.81 1.01

Time 2 Time 1

36 —

44 52

141 164

145 150

4.06 4.23

1.00 0.79

Time 2



36

184

146

4.28

0.71

Note. Item stem = “How satisfied are you with … ”. Scale anchors are 1 (very dissatisfied), 2 (dissatisfied), 3 (neither satisfied nor dissatisfied), 4 (satisfied), and 5 (very satisfied). aItem responses in Category 3 column are for Categories 1, 2, or 3 for Items 6 and 8.

unidimensional at both time periods, with all items loading onto a single factor at both occasions. Thus, gamma change can be ruled out because there was no difference in the number of factors or the factor structure over time. We also examined frequencies of item response categories for each item. Respondents tended to primarily choose response Options 2 through 5 in their responses at both Time 1 and Time 2. For seven of the items, 25 or fewer respondents used the first response option (strongly disagree) at Time 1, with Item 8 having fewer than 25 respondents in Categories 1 and 2 combined. All of the items at Time 2 had fewer than 25 people make use of the first response option, with Items 6 and 8 having less than 25 respondents utilizing the first and second response options combined. Because small numbers of responses for extreme response categories

290

MEADE, LAUTENSCHLAGER, HECHT

can lead to large standard errors of estimation for item parameters, response Options 1 and 2 were collapsed for all items, whereas Options 1, 2, and 3 were collapsed for Items 6 and 84 (cf. Hulin, Drasgow, & Parsons, 1983; Roznowski, 1989). A second EFA was conducted to ensure the data was unidimensional in its collapsed response category format. Results of the factor analysis were comparable to the results before collapsing categories for both measurement occasions, indicating that the data in the collapsed response format used in the data analyses remained unidimensional. Although there are many types of IRT DIF analyses for dichotomous items, methods are more limited for polytomous IRT models. In this study, the likelihood ratio (LR) test, originally developed for dichotomously scored data and adapted for polytomous data (Thissen, Steinberg, & Gerrard, 1986; Thissen, Steinberg, & Wainer, 1988, 1993), was used. LR Test Method The LR test involves nested model comparisons of item parameter estimates under different levels of constraint across the two measurement occasions (see Thissen et al., 1988, 1993). The IRTLRDIF (Thissen, 2001) program was used to perform the LR tests. This program utilizes several steps in computing DIF statistics. First, item parameters for all eight items are estimated, with the constraint that the item parameters for like items (e.g., Time 1, Item 1 and Time 2, Item 1) are equal for both time periods. Next, separate runs are performed in which all like items’ parameter estimates are constrained to be equal for both time periods except for the parameters of a single item, which are estimated separately for Time 1 and Time 2 data. Eight such runs are performed (one per item), each of which provide a G2 (asymptotically distributed as chi-square) value associated with each item. These G2 values indicate the improvement in model fit associated with freeing the item’s parameters. If the G2 statistic is significant, DIF is present for that item. For items in which DIF is indicated, additional analyses are conducted by the IRTLRDIF program. As before, all item parameters are constrained to be equal across time periods except for the DIF item in question. For this item, two runs are computed, with only the item’s b parameters constrained across time periods in the first run. In the second run, the scenario is reversed in which the item’s b parameters are freed, and the item’s a parameters are constrained. These runs provide tests of the a and b parameters’ equality, respectively, to determine the precise source of DIF for the item. Note that the LR tests closely parallel the nested model chi-square tests conducted via CFA ME/I analyses in which parameters are constrained in additional data runs to determine the source of lack of ME/I in the data. 4Note that an equal number of response options for each item is not required for estimation in the GRM (Embretson & Reise, 2000).

IRT AND MEASUREMENT INVARIANCE

291

RESULTS AND DISCUSSION The results of the LR tests are reported in Table 2. Results of the LR test indicated that Items 3, 4, and 7 were functioning differently at Time 1 and Time 2. Closer inspection reveals that the DIF present in those items was caused by differences in items’ b parameters in each case. The item parameters are reported in Table 3. The plots of the OCCs of the items exhibiting DIF are shown in Figures 5 through 7. TABLE 2 Results of the Likelihood Ratio Test Test Item

Parameters Tested

G2

df

All All All a b All a b All All All a b All

2.3 1.5 11.1* 0.0 11.1* 9.6* 0.9 8.7* 7.9 1.9 13.8* 0.4 13.3* 3.1

4 4 4 1 3 4 1 3 4 3 4 1 3 3

1 2 3

4

5 6 7

8

Note. G2 = values asymptotically distributed as chi-squares. The df for Items 6 and 8 reflect fewer response options than other items. *p < .05.

TABLE 3 Item Parameters for the Job Satisfaction Scale Time 1 Item 1 2 3 4 5 6 7 8

Time 2

a

b1

b2

b3

a

b1

b2

b3

1.09 0.97 0.48 0.73 1.16 1.33 0.83 1.28

–0.49 –0.40 –2.96 –1.94 –0.56

0.14 0.15 –0.34 –0.19 –0.20 –0.19 –0.36 –0.34

0.50 0.73 1.52 1.16 0.24 0.41 0.45 0.25

0.97 1.20 0.51 0.57 0.88 1.55 0.96 1.64

–0.72 –0.33 –3.59 –3.50 –1.12

0.06 0.08 –1.23 –0.79 –0.52 –0.02 0.29 –0.15

0.57 0.60 0.51 1.16 0.26 0.52 0.91 0.23

–1.04

Note. a = a parameters; b = b parameters.

–0.25

FIGURE 5 line.

Operating characteristic curves for Item 3. Time 1 in solid line, Time 2 in dashed

FIGURE 6 line.

Operating characteristic curves for Item 4. Time 1 in solid line, Time 2 in dashed

292

IRT AND MEASUREMENT INVARIANCE

FIGURE 7 line.

293

Operating characteristic curves for Item 7. Time 1 in solid line, Time 2 in dashed

The primary purpose of this study was to illustrate the use of IRT methods to assess beta change in an eight-item longitudinal measure of job satisfaction. Gamma change was ruled out by conducting an EFA on the scale at both time periods, and IRT LR tests were used to assess beta change over time. Three items were shown to have b parameter DIF, whereas no items showed significantly different a parameters across time periods. Examination of these plots in Figures 5 through 7 provide a graphical summary of these results. Given that IRT a parameters are analogous to CFA factor loadings and did not vary across time periods, it is unlikely that any of these items would have been determined to lack ME/I across time periods had the analyses been conducted using CFA. Although it is possible that CFA tests of equal item intercepts may have detected some differences in the data, these tests are typically not performed as tests of longitudinal ME/I. Moreover, as there are many more IRT b parameters than CFA item intercepts that must be equal over time periods, it is likely that the CFA tests would be considerably more liberal than the IRT tests in identifying this form of beta change. This would seem to be particularly true if CFA analyses of ME/I were conducted at the scale rather than item level as is typical. Although notoriously difficult to determine, we speculate on some of the possible reasons that DIF was found in this study. First, the time span between the two measurement periods was 15 years. It is likely that many respondents had changed jobs, organizations, or even occupations during this time span. However, the struc-

294

MEADE, LAUTENSCHLAGER, HECHT

ture of this data made it impossible to identify whether the participants had been promoted or had changed jobs, companies, or occupations during this time. Respondents in 1980 had only been out of college for 6 to 8 years and thus were still early in their careers. In the 15 years between 1980 and 1995, it is likely that the change in economic conditions (e.g., inflation) and peer reference groups could account for this recalibration. At Time 2, the respondents were in the middle career stage of their work life. The issues facing middle career individuals are markedly different from those in the early stage of their career. At this point in time, an individual may have been promoted, have experienced other forms of career success, and have received incremental pay increases over the 15 years since the first measurement (Feldman, 1988). As such, having a different social comparison group (e.g., comparing themselves to other lower level employees in 1980 and other managers in 1995) could also account for this recalibration in the response scale across the two measurement occasions. Given this, it is hardly surprising to find that what respondents viewed as appropriate compensation and benefits (Item 3) had changed over time. At mid-career, respondents appeared to have less difficulty in choosing the very satisfied option for Item 3 as would be expected for persons that have enjoyed some upward career mobility. Although appealing and plausible, this hypothesis needs further testing. Because this study established that beta change did occur between the Time 1 and Time 2 measurement periods, observed scores should not be used to make comparisons for these measurement periods. Instead, participants’ latent job satisfaction levels should be established through IRT estimation of the Θ parameter. Because IRT measures of Θ are invariant across measures of the same construct with different psychometric properties (Lord, 1980), these Θ estimates can be safely compared even though the Time 1 and Time 2 measures are not psychometrically equivalent.5 Ultimately however, as in any DIF study, the researcher will have to make judgments about what to do with items flagged as potentially problematic in his or her context. If sufficient items exist, the deletion of all DIF flagged items may not pose problems of reliable assessment provided that the construct domain remains adequately covered. Otherwise, the issue of inclusion may be examined by determining whether there is a difference in outcomes. Rationales for exclusion or inclusion of items need to be developed and examined carefully. Although this is important for ability tests and other lengthy tests with dichotomous item scoring, this is particularly important for Likert scales in which there are few items, and each item has been carefully crafted to fully capture the construct domain. With our particular measure of overall job satisfaction computed via aggregating responses to facet-based items, removing any item from the scale would change the interpreta-

5This

is presuming the scales are otherwise psychometrically sound.

IRT AND MEASUREMENT INVARIANCE

295

tion of the composite total score and thus was not an option. Instead, job satisfaction comparisons needed to be estimated at the latent level for our data even though this was considerably more cumbersome than examination of observed scores. Although the job satisfaction measure used in this study was developed specifically for this longitudinal assessment program, this measure is not unlike those used in other studies. This measure was carefully developed (see Mumford et al., 1990) and can be considered psychometrically sound, yet this measure exhibited beta change over time. Because this is the first study known to investigate the possibility of beta change with IRT methods for polytomous items, it is unknown how many other longitudinal measures may also exhibit beta change. A recent review by Riordan, Richardson, Schaffer, and Vandenberg (2001) indicated that 94% of longitudinal studies in organizational research did not test for gamma or beta change. Also, because IRT methods provide insight into beta change not typically available with CFA methodology, it is also possible that beta change may have occurred even on occasions in which CFA methods found ME/I over time. Limitations One issue of note in this study was the relatively low item a parameters associated with all three DIF items detected in this study. Although the scale was shown to be clearly unidimensional by our factor analyses, the a parameters for these items were somewhat low at both time periods. It is unclear why this would be the case given the results of our factor analyses. Although differences between items’ a parameters can certainly be expected for different scale items, we did not anticipate finding parameters this low in magnitude. For our data, some scale items were much more sensitive to changes in the latent construct than were others. Perhaps the most serious concern for the analyses in our study is the low sample size. Conventional wisdom in IRT research holds that sample sizes should be well above those present in this study. However, as an anonymous reviewer pointed out, the sample sizes encountered in this study are not uncommon in longitudinal research. Although item parameter estimation can certainly be done with smaller samples, typically large standard errors around those parameter estimates are encountered. With these larger standard errors comes both reduced power of the LR test to detect DIF items as well as increased chances of Type 1 and Type 2 errors. One limitation of the IRTLRDIF program used in this study is that it does not report standard errors associated with estimated item parameters. However, we do not believe that this was a problem in this study, as reduced power did not seem to be an issue. We found significant DIF for three items indicating that standard errors were small enough to determine that item parameters significantly differed across time periods. Furthermore, follow-up analyses using MULTILOG (Thissen, 1991) indicated that the standard errors of the item parameters were typically below .2. Also, interestingly, MUTLILOG analyses of the same data for the

296

MEADE, LAUTENSCHLAGER, HECHT

two time periods separately indicated considerably higher a parameters at both time periods (ranging from 0.82 to 2.63). Last, there are some indications that sample sizes of 150 may be adequate to detect some types of DIF with the IRT LR test (Meade & Lautenschlager, 2004), although much more work is needed in this area. Other longitudinal researchers should carefully inspect the standard errors associated with the parameter estimates of their data, however, to determine whether their sample size and other data properties are sufficient to utilize the IRT methods we described in this article. Conclusions IRT methods may be powerful indicators of beta change over time. In future studies, it is recommended that researchers (a) investigate the dimensionality of the scale for single scales or use confirmatory methods as recommended by Schmitt (1982) to investigate the possibility of gamma change over time for multidimensional scales and (b) use IRT methods to investigate differential functioning of items over time for each factor subscale. A recent example of this type of methodology for cross-gender comparisons can be found in Reise et al. (2001). If confirmatory methods do not indicate that gamma change has occurred, yet IRT methods identify DIF over time, then beta change has occurred. This study also identifies several needed areas of future research. First, more researchers need to test for the possibility of beta change over time so that the prevalence of such changes is known. Much is still unknown about the occurrence of gamma and beta change, as few researchers conduct these analyses (Riordan et al., 2001). If beta change proves to be very prevalent in longitudinal studies, some doubt may be cast on studies that do not establish measurement invariance before comparing differences in observed scores over time. Also, we second Raju et al.’s (2002) call for simulation studies that compare the results of CFA and IRT approaches for establishing measurement invariance. Recent simulation work has begun to directly compare these methodologies (e.g., Meade & Lautenschlager, 2004; Zumbo, 2003; Zumbo & Koh, 2005). Although the distinctions in these two methodologies suggest the possibility of different results for the methodologies, it remains to be seen whether or not these differences will materialize in practice. Furthermore, with recent advances in ordinal data models of CFA techniques, it may be possible to achieve similar results with CFA methods. However, as of yet, these models are either not far enough advanced or are not readily available to the general research community. Conversely, one advantage that current CFA models have over the IRT models used in this study is that covariances between like items across different time periods can be modeled. This is commonly done using a stacked covariance matrix in CFA analyses (see Vandenberg & Lance, 2000, for a review), yet is not possible using the GRM in IRT. Recently, however, random effect IRT models have been de-

IRT AND MEASUREMENT INVARIANCE

297

veloped to account for dependencies between items and testlets (Bradlow, Wainer, & Wang, 1999; Rijmen, Tuerlinckx, De Boeck, & Kuppens, 2003). One promising avenue of future research would be the expansion and application of these models to longitudinal data. For the meantime, it appears that current IRT methods may provide more and different item information than typical CFA methods alone, but the relation between these two methods remains both ambiguous and in need of further exploration.

ACKNOWLEDGMENT We thank Garnett Stokes and others at the University of Georgia involved with the Longitudinal Biodata Project for providing the data used in this study.

REFERENCES Alvares, K. M., & Hulin, C. L. (1972). Two explanations of temporal changes in ability-skill relationships: A literature review and theoretical analysis. Human Factors, 14, 295–308. Arnold, H., & Feldman, D. C. (1982). A multivariate analysis of the determinants of job turnover. Journal of Applied Psychology, 67, 350–360. Beehr, T. A., & Gupta, N. (1978). A note on the structure of employee withdrawal. Organizational Behavior and Human Performance, 21, 73–79. Bluedorn, A. C. (1979). Structure, environment, and satisfaction: Toward a causal model of turnover from military organizations. Journal of Military and Political Sociology, 7, 181–207. Bock, R. D., Muraki, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25, 275–285. Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage. Chan, D. (2000). Detection of differential item functioning on the Kirton Adaption-Innovation Inventory using multiple-group mean and covariance structure analyses. Multivariate Behavioral Research, 35, 169–199. Chan, K. Y., Drasgow, F., & Sawin, L. L. (1999). What is the shelf life of a test? The effect of time on the psychometrics of a cognitive ability test battery. Journal of Applied Psychology, 84, 610–619. Cronbach, L. J. (1992). Four Psychological Bulletin articles in perspective. Psychological Bulletin, 112, 389–392. Cronbach, L. J., & Furby, L. (1970). How should we measure “change”: Or should we? Psychological Bulletin, 74, 68–80. Edwards, J. R. (1994). The study of congruence in organizational behavior research: Critique and a proposed alternative. Organizational Behavior and Human Decision Processes, 58, 141–155. Edwards, J. R. (2001). Ten difference score myths. Organizational Research Methods, 4, 265–287. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Farkas, A. J., & Tetrick, L. E. (1989). A three-wave longitudinal analysis of the casual ordering of satisfaction and commitment on turnover decisions. Journal of Applied Psychology, 74, 855–868.

298

MEADE, LAUTENSCHLAGER, HECHT

Feldman, D. C. (1988). Managing careers in organizations. Glenview, IL: Scott, Foresman. Goldstein, H. (1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, 20, 369–377. Golembiewski, R. T., Billingsley, K., & Yeager, S. (1976). Measuring change and persistence in human affairs: Types of change generated by OD designs. Journal of Applied Behavioral Science, 12, 133–157. Henry, R. A., & Hulin, C. L. (1987). Stability of skilled performance across time: Some generalizations and limitations on utilities. Journal of Applied Psychology, 72, 457–462. Hulin, C. L., Drasgow, F., & Parsons, C. K. (1983). Item response theory: Application to psychological measurement. Homewood, IL: Dow-Jones-Irwin. Hulin, C. L., Roznowski, M., & Hachiya, D. (1985). Alternative opportunities and withdrawal decisions: Empirical and theoretical discrepancies and an integration. Psychological Bulletin, 97, 233–250. Iaffaldano, M. T., & Muchinsky, P. M. (1985). Job satisfaction and job performance: A meta-analysis. Psychological Bulletin, 97, 251–273. Judge, T. A., & Locke, E. A. (1993). Effect of dysfunctional thought processes on subjective well-being and job satisfaction. Journal of Applied Psychology, 78, 475–490. Judge, T. A., Thoresen, C. J., Bono, J. E., & Patton, G. K. (2001). The job satisfaction-job performance relationship: A qualitative and quantitative review. Psychological Bulletin, 127, 376–407. Lance, C. E., Lautenschlager, G. J., Sloan, C. E., & Varca, P. E. (1989). A comparison between bottom-up, top-down, and bidirectional models of relationships between global and life facet satisfaction. Journal of Personality, 57, 601–624. Locke, E. A. (1976). The nature and causes of job satisfaction. In M. D. Durincetti (Ed.), Handbook of industrial and organizational psychology (pp. 1297–1349). Chicago: Rand McNally. Lord, F. M. (1953). The relationship of test score to trait underlying the test. Educational and Psychological Measurement, 13, 517–548. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Maurer, T. J., Raju, N. S., & Collins, W. C. (1998). Peer and subordinate performance appraisal measurement invariance. Journal of Applied Psychology, 83, 693–702. McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Meade, A. W., & Lautenschlager, G. J. (2004). A comparison of item response theory and confirmatory factor analytic methodologies for establishing measurement equivalence/invariance. Organizational Research Methods, 7, 361–388. Mobley, W. H. (1977). Intermediate linkages in the relationship between job satisfaction and employee turnover. Journal of Applied Psychology, 63, 408–414. Mumford, M. D., Stokes, G. S., & Owens, W. A. (1990). Patterns of life history: The ecology of human individuality. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Muraki, E. (1990). Fitting a polytomous item response model to Likert-type data. Applied Psychological Measurement, 14, 59–71. Parducci, A. (1968). The relativism of absolute judgments. Scientific American, 219, 84–90. Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2002). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology, 87, 517–529. Reise, S. P., Smith, L., & Furr, M. (2001). Invariance on the NEO–PI Neuroticism Scale. Multivariate Behavioral Research, 36, 83–110. Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552–566.

IRT AND MEASUREMENT INVARIANCE

299

Rijmen, F., Tuerlinckx, F., De Boeck, P., & Kuppens, P. (2003). A nonlinear mixed model framework for item response theory. Psychological Methods, 8, 185–205. Riordan, C. M., Richardson, H. A., Schaffer, B. S., & Vandenberg, R. J. (2001). Alpha, beta, and gamma change: A review of past research with recommendations for new directions. In C. A. Schriesheim & L. L. Neider (Eds.), Equivalence of measurement (pp. 51–97) Greenwich, CT: Information Age Publishing. Roznowski, M. (1989). An examination of the measurement properties of the Job Descriptive Index with experimental items. Journal of Applied Psychology, 74, 805–814. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometric Monograph No. 17). Iowa City: Psychometric Society. Schaubroeck, J., & Green, S. G. (1989). Confirmatory factor analytic procedures for assessing change during organizational entry. Journal of Applied Psychology, 74, 892–900. Schmitt, N. (1982). The use of analysis of covariance structures to assess beta and gamma change. Multivariate Behavioral Research, 17, 343–358. Tait, M., Padgett, M. Y., & Baldwin, T. T. (1989). Job and life satisfaction: A reexamination of the strength of the relationship and gender effects as a function of the date of the study. Journal of Applied Psychology, 74, 504–507. Taris, T. W., Bok, I. A., & Meijer, Z. Y. (1998). Assessing stability and change of psychometric properties of multi-item concepts across different situations: A general approach. Journal of Psychology, 132, 301–316. Tett, R. P., & Meyer, J. P. (1993). Job satisfaction, organizational commitment, turnover intention, and turnover: Path analyses based on meta-analytic findings. Personnel Psychology, 46, 259–293. Thissen, D. (1991). MULTILOG users guide: Multiple categorical item analysis and test scoring using item response theory. Chicago, IL: Scientific Software International. Thissen, D. (2001). IRTLRDIF v.2.0b. Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. Chapel Hill: University of North Carolina. Thissen, D., Steinberg, L., & Gerrard, M. (1986). Beyond group mean differences: The concept of item bias. Psychological Bulletin, 99, 118–128. Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.) Test validity (pp. 147–169). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Upshaw, H. S. (1969). The personal reference scale: An approach to social judgment. In L. Berkowitz (Ed.), Advances in experimental social psychology (Vol. 4, pp. 315–371). New York: Academic. Vandenberg, R. J. (2002). Toward a further understanding of and improvement in measurement invariance methods and procedures. Organizational Research Methods, 5, 139–158. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4–69. Vandenberg, R. J., & Self, R. M. (1993). Assessing newcomer’s changing commitments to the organization during the first 6 months of work. Journal of Applied Psychology, 78, 557–568. Williams, L. J., & Hazer, J. T. (1986). Antecedents and consequences of satisfaction and commitment in turnover models: A reanalysis using latent variable structural equation methods. Journal of Applied Psychology, 71, 219–231. Youngblood, S. A. (1984). Work, nonwork, and withdrawal. Journal of Applied Psychology, 69, 106–117.

300

MEADE, LAUTENSCHLAGER, HECHT

Zumbo, B. D. (1999). The simple difference score as an inherently poor measure of change: Some reality, much mythology. In B. Thompson (Ed.). Advances in social science methodology (Vol. 5, pp. 269–304). Greenwich, CT: JAI. Zumbo, B. D. (2003). Does item-level DIF manifest itself in scale-level analyses? Implications for translating language tests. Language Testing, 20, 136–147. Zumbo, B. D., & Koh, K. H. (2005). Manifestation of differences in item-level characteristics in scale-level measurement invariance tests of multi-group confirmatory factor analyses. Journal of Modern Applied Statistical Methods, 4, 275–282.

Suggest Documents