Meade, A. W. & Lautenschlager, G. J. (2004, April). Same Question, Different Answers: CFA and Two IRT Approaches to Measurement Invariance. Symposium presented at the 19th Annual Conference of the Society for Industrial and Organizational Psychology, Chicago, IL.
Same Question, Different Answers: CFA and Two IRT Approaches to Measurement Invariance Adam W. Meade North Carolina State University
Gary J. Lautenschlager University of Georgia The effectiveness of confirmatory factor analytic (CFA) and item response theory (IRT) methods of assessing measurement invariance were investigated using simulated data with a known lack of invariance. Across all study conditions, IRT likelihood ratio (LR) tests consistently outperformed both CFA and IRT differential functioning of items and tests (DFIT) analyses in terms of detecting a lack of invariance known to exist. Implications for varied results of measurement invariance tests are discussed and recommendations for best practice are provided.
1
The history of research on measurement equivalence/invariance (ME/I) is long, varied, and multi-disciplinary (Riordan, Richardson, Schaffer, & Vandenberg, 2001; Vandenberg & Lance, 2000). Recently, interest in ME/I issues has been on the rise (Vandenberg, 2002), particularly with regard to the methodology used to make decisions regarding the appropriateness of comparisons across populations, time, and response mediums (e.g. web and paperand-pencil). While the most popular tests of ME/I in I/O Psychology are based on confirmatory factor analytic (CFA) procedures, several authors (Meade & Lautenschlager, 2003; Raju, Laffitte, & Byrne, 2002; Reise, Widaman, & Pugh, 1993) have proposed using well established item response theory (IRT) methods designed to test for ME/I (more commonly referred to as differential item functioning, DIF, in the IRT literature). Though some direct comparisons of CFA Authors note: Some analyses using these same data and portions of this manuscript can be found in: Meade, A. W. & Lautenschlager, G. J. (in press). A Comparison of Item Response Theory and Confirmatory Factor Analytic Methodologies for Establishing Measurement Equivalence/Invariance. Organizational Research Methods. Portions of this study were conducted as part of doctoral dissertation research at the University of Georgia.
and IRT methods of establishing ME/I have been published (Facteau & Craig, 2001; Maurer, Raju, & Collins, 1998; Raju et al., 2002; Reise et al., 1993), these studies do not fully explicate the similarities and differences in IRT and CFA methods of establishing ME/I. Although these studies rightfully point out obvious differences in the two methodologies (e.g. linear versus non-linear models; unidimensional versus multi-dimensional comparisons), they do not adequately highlight subtle differences between these two methods. Moreover, multiple types of IRT tests of ME/I exist which can provide substantially different results depending on which is used. These tests are commonly grouped together as though all IRT tests are the same when in fact they are substantially different. CFA Tests of ME/I CFA analyses can be described mathematically by the formula: xi = τi + λiξ + δi (1) where xi is the observed response to item i, τi is the intercept for item i, λi is the factor loading for item i, ξ is the latent construct, and δi is the residual/error term for item i. Under this methodology, it is clear that the observed response is a linear combination of a latent variable, an item intercept, a factor loading, and some residual/error score for the item.
Same Question, Different Answers
a person’s level of the latent trait (θ) is based on observed item responses given the item parameters. Two types of item parameters are frequently estimated for each item. The discrimination or a parameter represents the slope of the item trace line (called an item characteristic curve, ICC, or item response function, IRF) that determines the relationship between the latent trait and the probability of getting a dichotomous test item correct (see Figure 1). The second type of item parameter is the item location or b parameter that determines the horizontal positioning of the inflexion point of the ICC. For dichotomous IRT models, the b parameter is typically referred to as the item difficulty parameter (see Lord, 1980 or Embretson & Reise, 2000 for general introductions to IRT methods). The most commonly used IRT model for polytomous items, such as Likert scales commonly used in organizational research, is the graded response model (GRM; Samejima, 1969). In the GRM, the relationship between a participant’s level of a latent trait (θ) and a participants’ likelihood of choosing a particular response category is typically depicted by a series of boundary response functions (BRFs). An example of BRFs for an item with 5 response categories is given in Figure 2. The BRFs for each item are calculated using the function (3) a (θ -b )
Typically, several models are examined in order to detect a lack of ME/I. Researchers commonly first perform tests of covariance matrix equality (Vandenberg & Lance, 2000). If the omnibus test indicates that there are no differences in covariance matrices between the data sets, then the researcher may conclude that ME/I conditions hold for the data. In order to perform nested model tests, both data sets (representing groups, time periods, etc.) are examined simultaneously, holding only the pattern of factor loadings invariant. In other words, the same items are forced to load onto the same factors, but parameter estimates themselves are allowed to vary between groups. This baseline model of equal factor patterns provides a chi-square value that reflects model fit for item parameters estimated separately for each situation and represents a test of configural invariance (Horn & McArdle, 1992). Next, a test of factor loading invariance across situations is conducted by examining a model identical to the baseline model except that factor loadings are constrained to be equal across situations. The difference in the baseline and more restricted model is expressed as a chi-square statistic with degrees of freedom equal to the number of freed parameters. Although subsequent tests of individual item factor loadings can also be conducted, this is seldom done in practice and is almost never done unless the overall test of differing Λx matrices is significant (Vandenberg & Lance, 2000). Subsequent to tests of factor loadings, Vandenberg and Lance (2000) recommend tests of item intercepts, which are constrained in addition to the factor loadings constrained in the prior step. Next, the researcher is left to choose additional tests to best suit his or her needs. Possible tests could include tests of latent means, tests of equal item uniqueness terms, and tests of factor variances and covariances, each of which are constrained in addition to those parameters constrained in previous models as meets the researchers needs1. IRT Framework The IRT framework posits a log-linear, rather than a linear, model to describe the relationship between observed item responses and the level of the underlying latent trait, θ. The exact nature of this model is determined by a set of item parameters that are potentially unique for each item. The estimate of
Pik* (θ s ) =
e i s ik 1 + e a i ( θ s - bik )
where Pik (θ s ) represents the probability that an examinee (s) with an ability level θ will respond to item i at or above category k. There are one fewer b parameters (one for each BRF) than there are item response categories. As before, the a parameters represent item discrimination parameters. The BRFs are similar to the item characteristic curve (ICC) shown in Figure 1, except that mi - 1 BRFs are needed for each given item. Only one a parameter is needed for each item because this parameter is constrained to be equal across BRFs within a given item, although ai parameters may vary across items under the GRM. IRT LR Tests IRT tests of ME/I vary depending on the specific type of methodology used. Likelihood Ratio (LR) tests occur at the item level and like CFA methods, maximum likelihood estimation is used to estimate item parameters which results in a value of model fit known as the fit function. In IRT, this fit function value is an index of how well the given model (in this case a logistic model) fits the data as a result of the maximum likelihood estimation procedure used to estimate item parameters (Camilli & Sheppard, 1994). *
1
Note that recently some authors have used the terminology of mean and covariance structure analysis (MACS) to refer to CFA models in which latent means and item intercepts are estimated (e.g. Chan, 2000; Ployhart & Oswald, 2004). Our reference to CFA models throughout the manuscript are intended to include such models.
2
Same Question, Different Answers
for possible DIF. The DTF statistic is based on a sum of an item-level compensatory DIF (CDIF) statistic. The CDIF statistic is conceptually similar to a signed area difference measure in that differences between groups in the relationship between the latent trait and the probability of item response are signed. Thus, an item that “favors” Group 1 may have a positive sign while an item that “favors” Group 2 may have a negative sign. When these CDIF indices are summed across scale items, they result in the DTF index that allows DIF between groups to cancel at the scale level by summing the signed CDIF index values. Comparison of CFA and IRT Methods Item a parameters are conceptually analogous, and mathematically related, to factor loadings in the CFA methodology (McDonald, 1999). Item b parameters, have no clear equivalent in CFA though CFA item intercepts which define the value of the observed value when the latent construct equals zero, is most conceptually similar to IRT b parameters. However, whereas only one item intercept can be estimated for each item in CFA, the IRT-based GRM estimates several b parameters per item. Though CFA item intercepts are ostensibly most similar to these GRM b parameters, they are still considerably different in function and interpretation as only one intercept is estimated per item (see Figure 3 for a graphical representation of CFA item intercepts). When establishing measurement invariance within a given scale, commonly used IRT models (e.g. the graded response model, GRM) estimate more parameters per item than do CFA methods. For each item, typically a factor loading, item intercept, and an item uniqueness term is estimated with CFA ME/I methods while one a parameters and four b are estimated with IRT methods. The additional information provided by the IRT b parameters provides a more stringent test of ME/I than do their IRT counterparts. Given the similarity between CFA factor loadings and IRT a parameters, it is likely that CFA and IRT tests should be able to detect differences in items’ a parameters (see Figure 4) but CFA tests would be unlikely to detect differences in items’ b parameters (see Figure 5). However, the IRT LR tests and DFIT framework utilize the information gained via these additional item parameters in different ways. The IRT LR test compares each item’s parameters across data conditions in order to establish ME/I. This test is the most stringent as DIF may be indicated if even one item parameter differs for a single item. This method assumes a traditional definition of DIF, that is, if there exists some level of a latent trait for which persons with the same level of the latent trait would
The LR test involves comparing the fit of two models; a baseline (compact) model, and a comparison (augmented) model. First, the baseline model is assessed in which all item parameters for all test items are estimated with the constraint that the item parameters for like items (e.g., Group 1-item 1 and Group 2-item 1) are equal across situations (Thissen, Steinberg, & Wainer, 1988, 1993). This compact model provides a baseline likelihood value for item parameter fit for the model. Next, each item is tested, one at a time, for differential item functioning (DIF). In order to test each item for DIF, separate data runs are performed for each item in which all like items' parameter estimates are constrained to be equal across situations (e.g., time periods, groups of persons), with the exception of the parameters of the item being tested for DIF. This augmented model provides a likelihood value associated with estimating the item parameters for item i separately for each group. This likelihood value can then be compared to the likelihood value of the compact model via a chi-square test. IRT DFIT Tests The differential functioning of items and tests (DFIT) framework (Raju, van der Linden, & Fleer, 1995) has been used extensively in past studies in well published journals (e.g. Journal of Applied Psychology, see Facteau & Craig, 2001; Mauer et al., 1998; Raju et al., 2002). As such, excellent reviews of the specifics of the methodology can be found in these articles and elsewhere. Only a brief overview is provided here. The DFIT program results in both item level statistics of DIF and test level statistics of differential functioning (DTF). The primary index of DIF is the non-compensatory DIF index (NCDIF) which considers each item independently of all other scale items. The NCDIF index compares differences of response probabilities (called expected scores) for respondents in each sample. More specifically, expected scores are first computed for persons in the smaller group (i.e., focal group, usually a minority group in traditional studies of differential functioning) using item parameters estimated for that group. Secondly, expected scores are computed for persons in the focal group but using item parameters estimated from the larger sub-group (i.e., the referent group). In other words expected scores are computed for focal group members that reflect their probabilities of response if they had been referent group members. If these scores are said to differ by a non-trivial amount (using a cutoff scores instead of a parametric test), DIF (and possibly DTF) is said to exist. This index is considered non-compensatory as any difference in the relationship between the latent trait and probability of response is considered a flag
3
Same Question, Different Answers
parameter values for the lowest BRF for the Group 1 data. Constants of 1.2, 2.4, and 3.6 were then added to the lowest threshold in order to generate the threshold parameters of the other three BRFs necessary to generate Likert-type data with five category response options. The a parameter for each Group 1 item was also sampled from a random normal distribution, N[µ =1.25, σ= .07]. All data were generated using the GENIRV item response generator (Baker, 1994). Simulating Differences in the Data DIF was simulated by subtracting .25 from the Group 1 items’ a parameter for each DIF item in order to create the Group 2 DIF items’ a parameters. Though DIF items’ b parameters varied in several ways (to be discussed below), the overall magnitude of the variation was the same for each condition. Specifically, for each DIF item in which b parameters were varied, a value of .40 was either added to or subtracted from the Group 1 group b parameters in order to create the Group 2 group b parameters. This value is large enough to cause a noticeable change in the sample θ estimates derived from the item parameters, yet not so large as to potentially cause overlap with other item parameters (which are 1.2 units apart). There were three conditions simulated in which items’ b parameters varied. In the first condition, only one b parameter differed for each DIF item. This was accomplished by adding .40 to the items’ largest b value. This condition represents a case in which the most extreme option (e.g., a Likert rating of 5) is less likely to be used by Group 2 than Group 1. This type of difference may be seen, for example, in performance ratings if the culture of one department within an organization (Group 2) was to rarely assign the highest possible performance rating while another department (Group 1) is more likely to give this highest rating. In a second condition, each DIF item’s largest two b parameters were set to differ between groups. Again, this was accomplished by adding .40 to the largest two b parameters for the Group 2 DIF items. This simulated difference again could reflect generally more lenient performance ratings for one group (Group 1) than another (Group 2). In a third condition, each DIF item’s two most extreme b parameters were simulated to be different between groups. This was accomplished by adding .40 to the item’s largest b parameter while simultaneously subtracting the same value from the item’s lowest b parameter. This situation represents the case in which persons in one group (Group 2) are less likely to use the more extreme response options (1 and 5) than persons in Group 1. Such response patterns may be seen when comparing the results of an
have different probabilities of responding to an item based on some defined sub-group membership, then DIF is said to exist. By contrast, the DFIT framework takes a more pragmatic approach to DIF assessment. DIF and DTF is shown to exist only for persons in the sample and expected scores are computed only for those response options chosen by respondents in the sample. Thus if item parameters differ at some high or low areas of the latent continuum and there are few observed responses for extreme response options, these differences have minimal impact on the overall computations as they are typically represented with small numbers in a sample. Specifically, only the item responses and theta estimates of focal group members are considered when assessing DIF and DTF. As a result of this property of the DFIT framework and the use of cutscores instead of strict parametric tests, the DFIT framework is less stringent than LR tests in detecting DIF. In any case, however, the DFIT framework does directly utilize information provided by b parameters in computing latent trait estimates and expected scores. In this study, we simulated data with a known lack of ME/I and hypothesized the following: Hypothesis 1: Items with simulated differences in b parameters will be detected as lacking ME/I by both IRT methods, but will not detect a lack of ME/I by CFA methods. Hypothesis 2: Items with simulated differences in a parameters will be detected as lacking ME/I by CFA and both IRT methods. Method Data properties A short survey consisting of 6 Likert-type items with five response options was simulated in order to represent a single scale measuring a single construct. There were three conditions of the number of simulated respondents in this study: 150, 500, and 1000. In order to control for sampling error, one hundred samples were simulated for each condition in the study. In a given scale, it is possible that any number of individual items may show a lack of ME/I. In this study either 2 or 4 items were simulated to exhibit a lack of ME/I across situations (referred to as DIF items). In order to manipulate the amount of DIF present, Group 1 item parameters were simulated, then these item parameters were changed in various ways in order to simulate DIF in the Group 2 data. A random normal distribution, N[µ=-1.7, σ=.45] was sampled in order to generate the b
4
Same Question, Different Answers
item individually by estimating one model per item in which the factor loadings for that item was constrained to be invariant across samples while the factor loadings for all other items were allowed to vary across samples. These models were each compared to the baseline model in which all factor loadings were free to vary between samples. Thus, the test for each item comprised a nested model test with one degree of freedom. One issue involved with these tests was the choice of referent item (Rensvold & Cheung, 2001). As in all other CFA models an item known to be invariant was chosen as the referent item for all tests. Another issue in conducting item-level CFA tests is whether the tests should be nested. On one hand, it is unlikely that organizational researchers would conduct item-level tests of factor loadings unless there was evidence that there were some differences in factor loadings for the scale as a whole. Thus, in order to determine how likely researchers would be in accurately assessing partial metric invariance, it would seem that these tests should only be conducted if both the test of equality of covariance matrices and factor loadings identified some source of difference in the data sets. However, such an approach would make direct comparisons to IRT tests problematic as IRT tests are conducted for every item with no prior tests necessary. Conversely, not nesting these tests would misrepresent the efficacy of CFA item-level tests as they are likely to be used in practice. As direct CFA and IRT comparisons were our primary interests, we chose to conduct item-level CFA tests for every item in the data set. However, we have also attempted to report results that would be obtained if a nested approach were utilized. Lastly, for item level IRT and CFA tests, true positive (TP) rates were computed for each of the 100 samples in each condition by calculating the number of the items simulated to have DIF that were successfully detected as DIF items divided by the total number of DIF items generated. False positive (FP) rates were calculated by taking the number of items flagged as DIF items divided by the total number of items simulated to not contain DIF. These TP and FP rates were then averaged for all 100 samples in each condition. True and False Negatives (TN and FN) rates can be computed from TP and FP rates (i.e. TN = 1.0 – FP, and FN=1.0 – TP).
organizational survey for a multi-national organization across cultures as there is some evidence that there are differences in extreme response tendencies by culture (Clarke, 2000; Hui & Triandis, 1989; Watkins & Cheung, 1995). In order to reduce the number of comparisons, a and b parameters were not manipulated simultaneously in this study. Data Analysis The multi-group feature of LISREL 8.51 (Jöreskog & Sörbom, 1996), in which Group 1 and Group 2 raw data are input into LISREL separately, was used for all analyses. Five models were examined in order to detect a lack of ME/I. In the first (Model 1), omnibus tests of covariance matrix equality were conducted. Second, simulated Group 1 and Group 2 data sets were analyzed simultaneously, yet item parameters (factor loadings and intercepts), the factor’s variance, and latent means were allowed to vary across situations which serves as a baseline model (Model 2). Nested model chi-square difference tests were then conducted to evaluate the significance of the decrement in fit for each of the more constrained models described below. A test of differences in factor loadings between groups was conducted by examining a model identical to the baseline model except that factor loadings for like items were constrained to be equal across situations (e.g. Group 1 Item 1 = Group 2 Item 1, etc.; Model 3). Next, equality of item intercepts was tested across situations by placing similar constraints across like items (Model 4). Lastly, a model was examined in which factor variances were constrained across Group 1 and Group 2 samples (Model 5) as typical in ME/I tests of longitudinal data (Vandenberg & Lance, 2000). An alpha level of .05 was used for all analyses. As is customary in CFA studies of ME/I, if the CFA omnibus test of ME/I indicated that the data was not equivalent between groups, nested model analyses continued only until an analysis identified the source of the lack of ME/I. Thus, no further scale-level analyses were conducted after one specific test of ME/I was significant. Data were also analyzed using IRT-based LR tests (Thissen et al., 1988, 1993). These analyses were performed using the multi-group function in MULTILOG (Thissen, 1991). A p value of .05 was used for all analyses. DFIT tests were conducted by first estimating item and person parameters using MULTILOG, then by using DFITP6.0 (Raju, et al., 1995). NCDIF indices with a cutoff score of .096 were used for assessment of item-level DIF. One way in which CFA and IRT analyses can be directly compared is by conducting item-level tests for CFA analyses (Flowers, Raju, & Oshima, 2002). As such, we tested the invariance of each
Results Hypothesis 1 Hypothesis 1 for this study was that data simulated to have differences in only b parameters would be detected as lacking ME/I by IRT methods, but would exhibit ME/I with CFA based tests. Six conditions in which only b parameters differed
5
Same Question, Different Answers
across all conditions. In general, the DFIT tests tended to be very conservative in that they rarely identified an item as exhibiting DIF. Hypothesis 2 Two conditions (Conditions 8 and 9) provided a test of Hypothesis 2 – that data with differing a parameters will be detected as lacking ME/I by both IRT and CFA tests. For these conditions, the CFA scale-level analyses performed somewhat better, though not as well as hypothesized (see Table 3). When there were only two DIF items (Condition 8), only 25 and 26 samples (out of 100) were detected as having some difference by the CFA omnibus test of equality for sample sizes of 150 and 500 respectively, though this number was far greater for sample sizes of 1000. For those samples with differences detected by the CFA omnibus test, factor loadings were identified as the source of the simulated difference in ten samples or less for each of the conditions. These results were unexpected given the direct analogy and mathematical link between IRT a parameters and factor loadings in the CFA paradigm (McDonald, 1999). Item-level CFA tests also showed poor performance for these tests (see Table 4). Like those samples discussed earlier, items in samples with only differences in a parameters also were rarely classified as showing a lack of measurement invariance. Also as before, one positive finding was that FP rates remained low. Similar findings were also found for the DFIT tests with these samples. LR tests performed considerably better than CFA and DFIT tests for these data, though correct identification of DIF items was lower than desired for sample sizes of 150 and 500. However, despite the disappointing performance for smaller sample sizes, LR tests still outperformed other tests for these conditions. Discussion In this study it was shown that, as expected, CFA methods of establishing ME/I were inadequate at detecting items with differences in only b parameters. However, contrary to expectations, the CFA methods were also largely inadequate at detecting differences in item a parameters. These latter results question the seemingly clear analogy between IRT a parameters and factor loadings in a CFA framework. IRT based LR tests were somewhat better suited for detecting differences when they were known to exist. However, the LR tests also had a somewhat low TP rate for sample sizes of 150. This is somewhat to be expected, as a sample size of 150 (per group) is extremely small by IRT standards. Examination of the Multilog output revealed that the standard errors associated with parameter estimates
between data sets provided the central test of this hypothesis. Table 1 presents the number of samples in which a lack of ME/I was detected for the CFA scale-level analyses. For these conditions, the CFA omnibus test of ME/I was largely inadequate at detecting a lack of ME/I. This was particularly true for sample sizes of 150 in which no more than 3 samples per condition were identified as exhibiting a lack of ME/I as indexed by the omnibus test of covariance matrices. For Conditions 6 (highest 2 b parameters) and 7 (extreme b parameters) with sample sizes of 500 and 1000, the CFA scale-level tests were better able to detect some differences between the data sets, however the source of these differences that was identified by specific tests of ME/I seemed to vary in unpredictable ways with item intercepts being detected as different somewhat frequently in Condition 6 and rarely in Condition 7 (see Table 1). Note that the results in the tables indicate the number of samples in which specific tests of ME/I were significant. However, because of the nested nature of the analyses, these numbers do not reflect the percentage of analyses that were significant, as analyses were not conducted if an earlier specific test of ME/I indicated that the data lacked ME/I. Interestingly, CFA item-level TP rates were higher for sample sizes of 150 than 500 or 1000, though in general TP rates were very low for CFA item-level analyses (see Table 2). When CFA itemlevel analyses were nested (i.e. conducted only after significant results from the omnibus test of covariances matrices and scale-level factor loadings), TP rates were very poor indeed. As these nested model item-level comparisons reflect the scenario under which these analyses would be most likely conducted, it appears highly unlikely that differences in items would be correctly identified. Such misidentification would preclude establishing partial invariance (Byrne, Shavelson, & Muthen, 1989). Investigating TP and FP rates for the LR tests reveal an entirely different scenario. LR tests were largely effective at detecting the specific items with simulated DIF (see Table 2). Across all conditions, the LR TP rates were considerably higher for the larger sample sizes than the N=150 sample size. FP rates, were slightly higher for sample sizes of 500 and 1000 than for 150 due to increased power for these tests. However, the FP rate was typically below .15 range even for the 1000 case sample sizes. In sum, it appears that the LR index was very good at detecting some form of DIF between groups. However, correct identification of the source of DIF was more problematic for this index, particularly for sample sizes of 150. Results for the DFIT tests were very poor
6
Same Question, Different Answers
LR tests in that they were very insensitive the minor amounts of DIF simulated in this study. The three ME/I tests clearly provide different information regarding the equivalence of measures over conditions. As such, we suggest that researchers conduct multiple ME/I tests using both IRT and CFA methods whenever feasible. Examples of the use of both IRT and CFA methodologies are available in existing literature (c.f. Schmit, Kihm, & Robie, 2000; e.g., Facteau & Craig, 2001; Maurer et al., 1998; Zickar & Robie, 1999) and each can provide unique information that could be useful in both research and organizational decision making. However, we recognize that performing three types of analyses may not be feasible in all situations due to time and resource constraints. In such cases, it is important that researchers and practitioners fully understand the advantages and disadvantages of the two methods. As only the CFA analyses provide information regarding the relationship between latent factors, its use would be preferable when the research goal is to examine questions of the equivalence of a multi-factorial framework (e.g. the five factor model of personality, organizational culture perceptions, etc.). When the equivalence of a single scale or specific scale items is of interest then IRT LR analyses are more desirable. With the additional parameters estimated via IRT methods, more information is present on which ME/I tests can be conducted. These additional parameters (b parameters in the GRM) provide considerably more psychometric information at the item response level than do their CFA counterparts. However, IRT analyses generally require larger sample sizes in order to adequately estimate these additional parameters and require at least a moderate amount of scale items in order to adequately estimate both latent trait scores and item parameters (Embretson & Reise, 2000). In addition, the pair-wise nature of IRT LR tests would make them particularly cumbersome when ME/I needs to be established across several groups. One thing that is clear from this study is that these three tests vary considerably in their sensitivity to the small to moderate amounts of differences present in these data. As such, the goals of the ME/I test should be evaluated prior to choosing an analytic method. If ME/I is being established across departments of an organization for an organizational climate survey, then perhaps CFA or DFIT methods would be more appropriate as they are more likely to disregard moderate differences among properties of test items. Minor differences in a limited number of individual items may not be cause for concern in this case. However, if evaluating the comparability of an
with this sample size are somewhat larger than would be desired. However, we felt that it was important to include this sample size as it is commonly found in ME/I studies and in organizational research in general. It is important to note that when there were no differences simulated between the Group 1 and Group 2 data sets (Condition 1), the LR tests detected some lack of ME/I in many of the conditions (see Tables 2 and 4). This result for the LR test is similar to the inflated FP rate reported by McLaughlin and Drasgow (1987) for Lord’s (1980) chi-square index of DIF (of which the LR test is a generalization). The inflated Type I error rate indicates a need to choose an appropriately adjusted alpha level for LR tests in practice. In this study, an alpha level of .05 was used to test each item in the scale with the LR test. As a result, the condition-wise error rate for the LR AnyDIF index was .30 (with 6 items), which lead to an inflated LR FP rate. DFIT analyses faired very poorly in our analyses. By their nature, DFIT tests tend be more conservative and thus classify few items as DIF items. In this study, relatively small amounts of DIF were simulated for either 2 or 4 item parameters. Apparently, this small amount of simulated difference was not sufficient to cause the NCDIF index to flag many items as DIF items. Implications While several authors have compared and contrasted IRT and CFA ME/I approaches, it is important to emphasize that the two methods (1) have different assumptions (e.g. linearity/non-linearity), (2) provide different ME/I information, and that (3) no method is without flaw. As such, researchers and practitioners may receive an incomplete ME/I picture of the psychometric properties of a measure by using only one methodology. Importantly, among IRT approaches there is wide variability in both assumptions regarding the nature and importance of DIF and in the mathematical processes used to calculate DIF indices/tests. As this study makes clear, these methods result in very different conclusions regarding the data. Moreover, in some situations it is possible that ME/I tests could be misleading such as with CFA tests when sample sizes are small, factor communalities are low (Meade & Lautenschlager, 2004) or when differences that parallel those of b parameters are present. IRT analyses were also sometimes misleading with small sample sizes (typical of much of organizational research), and provide no information about the relationship between factors which could be important in some situations. Additionally, results from the DFIT tests more closely mirrored those of the CFA tests than the
7
Same Question, Different Answers
advantages in data analysis. We acknowledge that the IRT software creates item to latent trait relationships that are not linear. However, only through IRT models could we simulate an optimal operationalization of a subtle lack of ME/I via differences in b parameters. Furthermore, the CFA analyses in this study encountered very few problems with estimation of the models and in general model fit was typical of that encountered in many ME/I studies involving actual data. Also, as Raju et al. (2002) state, the item to trait relationship modeled (and created) in IRT is at least as likely, if not more likely, to hold true for non-simulated data as is the linear relationship modeled by CFA analyses. Future Research While providing some light on the nature of the differences between IRT and CFA ME/I methods, there are areas of future research that are strongly needed. Many researchers using CFA methods create item bundles or parcels in order to reduce the number of parameters estimated during analysis. This practice may have large implications for the detection of ME/I in the CFA approach. Specifically, we believe that our results indicate that parceling items may further cloud the outcome of CFA ME/I results. Obviously as fewer item parameters are estimated, less information is available that can be tested for ME/I. Moreover, we contend that individual item responses lead to a lack of ME/I, so investigations of ME/I should be focused at the individual item level when possible. However, it may be useful for future simulation work to extend the present research in this direction. Vandenberg (2002) and Riordan et al. (2001) have called for Monte Carlo studies into the properties of CFA tests of ME/I. In addition Raju et al. (2002) recently called for exactly this type of study in order to further our understanding of tests of ME/I (and DIF) and the relationship between the CFA and IRT methods. Though the findings of this study and the conceptual arguments for why IRT analyses might be expected to outperform CFA analyses in some situations begin to highlight real differences between the methodologies, much further work is needed. We reiterate Raju et al.’s (2002) call for large simulation studies investigating the situations in which CFA and IRT might be preferable. Possible conditions to examine include different assumptions concerning the data such as the degree of robustness to violations of normality, different sample sizes typically encountered in organizational research, different numbers of scale items, and different amounts of dimensionality present in the data. We hope that this study provides a first of many steps toward establishing the conditions under which IRT versus CFA analyses are
employment selection test across racial groups, LR tests may be more appropriate as increased sensitivity is of paramount importance in this case. Under ideal conditions, it would be desirable to consider both approaches when examining ME/I. First, measurement equivalence could be examined using IRT methods at the item level within each scale or subscale desired. Items that satisfy these conditions could then be used in CFA tests for individual scales and in more complex measurement models involving several scales simultaneously. When item parceling is necessary, using IRT results to form item parcels based on the psychometric properties of the items (i.e., item a and b parameters) and to ensure item-level ME/I before conducting CFA analyses would be optimal. Finally, a broader implication of these analyses would suggest the need for a reconceptualization of the purpose of ME/I tests. While both CFA methods and LR tests employ parametric statistical tests of model fit to evaluate the equivalence across groups, perhaps these tests are best not thought of in the same way that researchers traditionally think of statistical tests. While these tests are technically all-or-nothing statistical tests, they may be more practically regarded as pieces of information regarding the equivalence of measures. With each ME/I test that indicates equivalence across conditions, researchers and practitioners can be more assured that their measures are behaving equivalently. If DFIT, LR, and CFA tests indicate invariance across conditions, a high degree of certainty could be placed on cross-condition comparisons. If only one method indicated ME/I, conclusions of cross-condition comparisons should be tempered. An all-or-nothing approach to measure equivalence does not seem warranted when different statistical tests can provide substantially different answers to ostensibly the same question. Limitations As with most simulation studies, this study is limited in scope. There are many different possible properties of data used by researchers and practitioners when conducting tests of ME/I. We simulated only a small number of these data properties as an exploratory study into the efficacy of CFA and IRT tests of ME/I. Thus, although these analyses explicate some areas of concern regarding ME/I tests, they by no means represent the vast array of situations encountered by researchers conducting these tests. A second limitation could be that IRT based software (GENIRV) was used to create the data for which CFA and IRT analyses were conducted. It may seem that if data were created using an IRT framework, IRT analyses might have some
8
Same Question, Different Answers
Associates, Publishers. McLaughlin, M.E. & Drasgow, F. (1987). Lord’s chi-square test of item bias with estimated and known person parameters. Applied Psychological Measurement, 21, 161-173. Meade, A. W., & Lautenschlager, G. J. (2003). A Comparison of IRT and CFA Methodologies for Establishing Measurement Equivalence with Simulated Data. Paper presented at the 18th Annual Society for Industrial/Organizational Psychology conference, Orlando, FL. Meade, A. W., & Lautenschlager, G. J. (2004). A Monte-Carlo Study of Confirmatory Factor Analytic Tests of Measurement Equivalence/Invariance. Structural Equation Modeling, 11(1), 60-72. Raju, N. S., Laffitte, L. J., & Byrne, B. M. (2002). Measurement equivalence: A comparison of methods based on confirmatory factor analysis and item response theory. Journal of Applied Psychology, 87(3), 517-529. Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19(4), 353-368. Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114(3), 552-566. Rensvold, R. B., & Cheung, G. W. (2001). Testing for metric invariance using structural equation models: Solving the standardization problem. In C. A. Schriesheim & L. L. Neider (Eds.), Research in management (Vol. 1): Equivalence in measurement (pp. 21-50). Greenwich, CT: Information Age. Riordan, C. M., Richardson, H. A., Schaffer, B. S., & Vandenberg, R. J. (2001). Alpha, beta, and gamma change: A review of past research with recommendations for new directions. In C. A. Schriesheim & L. L. Neider (Eds.), Equivalence of Measurement. Greenwich, CT:: Information Age Publishing. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph, No. 17. Schmit, M. J., Kihm, J. A., & Robie, C. (2000). Development of a global measure of personality. Personnel Psychology, 53(1), 153-193. Thissen, D. (1991). MULTILOG users guide: Multiple categorical item analysis and test
more suitable for establishing ME/I. References Baker, F. (1994). GENIRV: Computer program for generating item response theory data. Madison: University of Wisconsin, Laboratory of Experimental Design. Byrne, B. M., Shavelson, R. J., & Muthen, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105(3), 456-466. Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage. Clarke, I., III. (2000). Extreme response style in cross-cultural research: An empirical investigation. Journal of Social Behavior and Personality, 15, 137-152. Embretson, S. E. & Reise, S. P. (2000). Item Response Theory for psychologists. Mahwah, NJ: Erlbaum. Facteau, J. D., & Craig, S. B. (2001). Are performance appraisal ratings from different rating sources comparable? Journal of Applied Psychology, 86(2), 215-227. Flowers, C.P., Raju, N.S., & Oshima, T.C. (2002, April). A comparison of measurement equivalence methods based on confirmatory factor analysis and item response theory. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA. Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measurement invariance in aging research. Experimental Aging Research, 18(3-4), 117-144. Hui, C. H., & Triandis, H. C. (1989). Effects of culture and response format on extreme response style. Journal of Cross Cultural Psychology, 20, 296-309. Jöreskog, K. & Sörbom, D. (1996). LISREL 8: Users Reference Guide. Chicago: Scientific Software International. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates. Maurer, T. J., Raju, N. S., & Collins, W. C. (1998). Peer and subordinate performance appraisal measurement equivalence. Journal of Applied Psychology, 83(5), 693-702. McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum
9
Same Question, Different Answers
scoring using item response theory (computer program). Chicago: Scientific Software International. Thissen, D., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 147169). Hillsdale, NJ: Erlbaum. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67-113). Hillsdale, NJ: Erlbaum. Vandenberg, R. J. (2002). Toward a further understanding of an improvement in measurement invariance methods and procedures. Organizational Research Methods, 5(2), 139-158. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4-69. Watkins, D., & Cheung, S. (1995). Culture, gender, and response bias: An analysis of responses to the Self-Description Questionnaire. Journal of Cross Cultural Psychology, 26, 490-504. Zickar, M. J., & Robie, C. (1999). Modeling faking good on personality items: An item-level analysis. Journal of Applied Psychology, 84, 551-563. Author Contact Information: Adam W. Meade Department of Psychology Campus Box 7801 North Carolina State University Raleigh, NC 27695-7801 Ph: 919.513.4857 Fx: 919.515.1716 e-mail:
[email protected] web: http://www4.ncsu.edu/~awmeade Gary J. Lautenschlager Department of Psychology University of Georgia Athens, GA 30602-3013 Ph: 706.542.3054 e-mail:
[email protected]
10
Same Question, Different Answers
Table 1. Number of samples (out of 100) in each condition in which there was a significant lack of ME/I for data with only differing b parameters. CFA Scale Level Analyses Cond. Number 1
2
3
4
5
6
7
N
Σ
Λx
Τx
Φ
0 DIF Items
150
1
1
0
0
No A DIF
500
2
0
0
1
No B DIF
1000
0
0
0
0
2 DIF Items
150
2
2
0
0
No A DIF
500
4
0
0
1
Highest B DIF
1000
1
0
0
0
2 DIF Items
150
0
0
0
0
No A DIF
500
21
2
17
1
Highest 2 B DIF
1000
9
0
3
0
Description
2 DIF Items
150
3
3
0
0
No A DIF
500
33
3
3
15
Extremes B DIF
1000
20
1
0
4
4 DIF Items
150
1
0
0
0
No A DIF
500
7
1
2
3
Highest B DIF
1000
9
2
0
0
4 DIF Items
150
1
0
0
0
No A DIF
500
61
2
35
16
Highest 2 B DIF
1000
90
7
10
32
4 DIF Items
150
2
0
0
0
No A DIF
500
80
3
6
43
Extremes B DIF
1000
100
3
0
49
Note: Items simulated to be DIF items in bold. Σ=null test of equal covariance matrices, Λx = test of equal factor loadings, Τx = test of equal item intercepts, Φ = test of equal factor variances. CFA scale-level analyses are fully nested (e.g. no Τx test if test of Λx is significant).
11
Same Question, Different Answers
Table 2. True and False Positive Rates for Item-Level Tests DFIT Condition Number 1
Description 0 DIF Items, No A DIF, No B DIF
2
2 DIF Items, No A DIF, Highest B DIF
3
2 DIF Items, No A DIF, Highest 2 B DIF
4
2 DIF Items, No A DIF, Extremes B DIF
5
4 DIF Items, No A DIF, Highest B DIF
6
4 DIF Items, No A DIF, Highest 2 B DIF
7
4 DIF Items, No A DIF, Extremes B DIF
N
TP
CFA - Nested FP
TP
FP
CFA – not nested TP
FP
LR TP
FP
150
0.00
0.00
0.06
0.06
500
0.00
0.00
0.05
0.05
1000
0.00
0.00
0.02
0.05
150
0.01
0.02
0.01
0.00
0.10
0.05
0.18
0.03
500
0.00
0.00
0.00
0.00
0.03
0.07
0.45
0.07
1000
0.00
0.00
0.00
0.00
0.02
0.04
0.60
0.09
150
0.02
0.00
0.00
0.00
0.13
0.03
0.33
0.04
500
0.00
0.00
0.00
0.01
0.02
0.06
0.91
0.14
1000
0.00
0.00
0.00
0.00
0.03
0.03
0.91
0.13
150
0.01
0.01
0.02
0.01
0.18
0.07
0.45
0.05
500
0.00
0.00
0.00
0.01
0.01
0.05
0.78
0.06
1000
0.00
0.00
0.01
0.01
0.05
0.05
1.00
0.08
150
0.01
0.00
0.00
0.00
0.05
0.04
0.16
0.01
500
0.00
0.00
0.00
0.00
0.04
0.04
0.59
0.09
1000
0.00
0.00
0.00
0.00
0.01
0.00
0.66
0.08
150
0.01
0.00
0.00
0.00
0.04
0.06
0.15
0.02
500
0.00
0.00
0.01
0.00
0.02
0.05
0.82
0.27
1000
0.00
0.00
0.01
0.01
0.04
0.01
0.92
0.14
150
0.01
0.02
0.00
0.00
0.08
0.11
0.37
0.02
500
0.00
0.00
0.02
0.01
0.03
0.04
0.84
0.06
1000
0.00
0.00
0.01
0.00
0.03
0.02
1.00
0.10
12
Same Question, Different Answers
Table 3. Number of samples (out of 100) in each condition in which there was a significant lack of ME/I for data with only differing a parameters. CFA Scale Level Analyses Cond. Number 1
8
9
N
Σ
Λx
Τx
Φ
0 DIF Items
150
1
1
0
0
No A DIF
500
2
0
0
1
No B DIF
1000
0
0
0
0
2 DIF Items
150
25
10
0
0
A DIF
500
26
1
6
13
No B DIF
1000
94
3
0
11
4 DIF Items
150
29
6
1
7
A DIF
500
91
6
8
48
No B DIF
1000
100
5
2
39
Description
Note: Items simulated to be DIF items in bold. Σ=null test of equal covariance matrices, Λx = test of equal factor loadings, Τx = test of equal item intercepts, Φ = test of equal factor variances. CFA scale-level analyses are fully nested (e.g. no Τx test if test of Λx is significant).
13
Same Question, Different Answers
Table 4. True and False Positive Rates for Item-Level Analyses DFIT Condition Number 1
Description 0 DIF Items, No A DIF, No B DIF
8
2 DIF Items, A DIF, No B DIF
9
4 DIF Items, A DIF, No B DIF
N
TP
CFA - Nested FP
TP
FP
CFA – not nested TP
FP
LR TP
FP
150
0.00
0.00
0.06
0.06
500
0.00
0.00
0.05
0.05
1000
0.00
0.00
0.02
0.05
150
0.02
0.01
0.03
0.01
0.12
0.04
0.54
0.03
500
0.00
0.00
0.00
0.00
0.02
0.02
0.50
0.08
1000
0.00
0.00
0.01
0.02
0.04
0.04
1.00
0.09
150
0.02
0.03
0.02
0.02
0.09
0.12
0.35
0.04
500
0.00
0.00
0.01
0.00
0.04
0.03
0.66
0.05
1000
0.00
0.00
0.02
0.01
0.05
0.05
0.99
0.09
14
Same Question, Different Answers
Figure 1. ICC for Dichotomous Item
1.0
0.8
P(theta)
0.6
0.4
0.2
0 -4.0
-3.0
-2.0
-1.0
0.0 Theta
15
1.0
2.0
3.0
4.0
Same Question, Different Answers
Figure 2. BRFs for Polytomous (Likert) Item.
1
0.8
P(theta)
0.6
0.4
0.2
0 -4.0 -3.0 -2.0 -1.0
0.0 Theta
16
1.0
2.0
3.0
4.0
Same Question, Different Answers
Figure 3. Graphical representation of CFA item intercepts. Adapted from Rensvold & Cheung (2001).
λi1
Xi
τi
0
ξ1
κ1
17
Same Question, Different Answers
Figure 4. BRFs of Item Exhibiting a Parameter DIF. Group 2 BRFs in Dashed Line.
1
P(theta)
0.8
0.6
0.4
0.2
0 -4.0 -3.0 -2.0 -1.0 0.0
Theta
18
1.0
2.0
3.0
4.0
Same Question, Different Answers
Figure 5. BRFs of Item Exhibiting b Parameter DIF. Group 2 BRFs in Dashed Line.
1
P(theta)
0.8
0.6
0.4
0.2
0 -4.0 -3.0 -2.0 -1.0 0.0
1.0
Theta
19
2.0
3.0
4.0