The Utility of Alternative Fit Indices in Tests of ... - Semantic Scholar

9 downloads 1055 Views 169KB Size Report
not established the power of fit indices to detect data with a lack of invariance. ... results indicate that alternative fit indices can be successfully used in MI ...
Meade, A. W., Johnson, E. C., & Braddy, P. W. (2006, August). The Utility of Alternative Fit Indices in Tests of Measurement Invariance. Paper presented at the annual Academy of Management conference, Atlanta, GA.

The Utility of Alternative Fit Indices in Tests of Measurement Invariance Adam W. Meade Emily C. Johnson Phillip W. Braddy North Carolina State University Confirmatory factor analytic tests of measurement invariance (MI) based on the chi-square statistic are known to be sensitive to sample size. For this reason, Cheung and Rensvold (2002) recommended using alternative fit indices in MI investigations. However, previous studies have not established the power of fit indices to detect data with a lack of invariance. In this study, we investigated the performance of fit indices with simulated data known to not be invariant. Our results indicate that alternative fit indices can be successfully used in MI investigations. Specifically, we suggest reporting McDonald’s noncentrality index along with CFI, and Gammahat.

Measurement invariance (MI) can be considered the degree to which measurements conducted under different conditions yield measures of the same attributes (Drasgow, 1984; Horn & McArdle, 1992). These different conditions include stability of measurement over time (Chan, 1998; Chan & Schmitt, 2000), across different populations (e.g., cultures, Riordan & Vandenberg, 1994; gender, Marsh, 1985, 1987; age groups, Marsh & Hocevar, 1985), rater groups (e.g., Facteau & Craig, 2001), or over different mediums of measurement administration (Chan & Schmitt, 1997; Ployhart, Weekley, Holtz, & Kemp, 2003). Recently, there has been a substantial increase in research involving tests of MI due in part to an increased awareness of both the importance of comparing equivalent measures, as well as increased access and understanding of the methodology utilized to perform tests of MI (Meade & Lautenschlager, 2004; Vandenberg, 2002). Though multiple methods of establishing MI exist, multiple-group confirmatory factor analysis (CFA) has been the most commonly used method in organizational research (Vandenberg & Lance, 2000). With these tests, constrained and free CFA models are typically compared using a chi-square-based likelihood ratio test (LRT; sometimes called a chisquare difference test). However, like chi-square tests of overall model fit, the LRT has been shown to be sensitive to sample size (Brannick, 1995; Kelloway, 1995; Meade & Lautenschlager, 2004).

Thus, in large samples, power to detect even trivial differences in the properties of a measure between groups is extremely high, potentially leading to overidentification of a lack of invariance (LOI). For this reason, Cheung and Rensvold (2002) examined the potential use of change in alternative fit indices in MI investigations. As with overall model fit, these alternative fit indices (AFIs) are less strongly affected by sample size than is chi-square in measurement invariance tests. While their groundbreaking work is extremely promising, one crucial omission from Cheung and Rensvold’s study is that they only examined the performance of AFIs under the null hypothesis of perfect MI between groups. Thus, while they recommended the use of the some AFIs, the power of these indices to detect a lack of invariance between groups is unknown. This deficiency in the literature precludes more widespread use of AFIs in MI studies as researchers have no indication that AFIs are sensitive to an LOI. One reason for this omission from their study is that no standard measure or amount of effect size has been established in MI research. Thus, it is difficult to justify the simulation of any one level of LOI between groups. This study overcomes this limitation by generating data with many levels of a lack of invariance (from trivial to severe) in order to examine the performance of AFIs in MI tests of equal factor loadings.

AFIs in Measurement Invariance CFA Tests of MI Measurement invariance can be technically defined in terms of probabilities such that in order for MI to exist, the probability of observed responses conditioned upon latent scores must be unaffected by group membership (Meredith & Millsap, 1992; Millsap, 1995). Commonly used CFA tests of MI involve simultaneously fitting a measurement model to two or more data samples. The multi-group CFA measurement model between p observed variables and m latent factors is given by the equation: Xg = g + g g + g , (1) where X is a px1 vector of observed scores, is a px1 vector of intercepts, is a pxm matrix of factor loadings, is a mx1 vector of latent variable scores, is a px1 vector of unique factor scores, and g denotes that these parameters are group specific. Observed variable covariances are then given as: (2) g= g g ’g + g, where g is a pxp matrix of observed score covariances, is a mxm latent variance/covariance matrix, and is a pxp diagonal matrix of unique variances. MI can therefore exist for multiple parts of the CFA model. For instance, if g= g for all groups, metric invariance is said to exist (Horn & McArdle, 1992); if g = g for all groups, scalar invariance is indicated (Meredith, 1993); and if g = g for all groups, uniqueness invariance exists. If all three types of invariance are found, strict factorial invariance is indicated (Meredith, 1993) such that differences in observed score means or covariances are a product of differences in latent means (sometimes called impact; Holland & Wainer, 1993) or latent covariances. Typically, when conducting CFA MI tests, a sequence of nested multi-group models are examined in order to detect an LOI across samples. In the first model, both data sets (representing groups, time periods, etc.) are examined simultaneously, holding only the pattern of factor loadings invariant. This model serves two functions: First, it serves as a test of configural invariance (Horn & McArdle, 1992); that is, poor fit of this model indicates that either the same factor structure does not hold for the two samples, or that the model is misspecified in one or both samples. The second function of the configural invariance model is that it serves as a baseline of model fit for comparison to other, more restrictive models. Once adequate fit is established for this model, tests of equality of parameters in the CFA model are conducted in a series of sequential models in which typically factor loadings, intercepts, and uniqueness terms or other model parameters are constrained in sequence. Once a statistically significant decrement in model fit is witnessed, an

2

LOI is indicated in those parameters most recently constrained (see Vandenberg & Lance, 2000 for a review). While there is some disagreement as to how many model parameters must be equal before MI is established, the most commonly investigated portion of the MI model are tests of equality of factor loadings (Vandenberg & Lance, 2000). Moreover, factor loadings and item intercepts are generally considered to be the most important aspects of the model essential for MI to be established (Meade & Kroustalis, in press). For this reason, we focused on MI tests of factor loadings (metric invariance) for this initial investigation. Alternative Fit Indices (AFIs) We could locate only one published study that has simulated data in order to determine the feasibility of using differences in AFIs in order to establish measurement invariance. In this study, Cheung and Rensvold (2002) achieved several important goals. First, they specified three criteria desirable in an AFI used for establishing MI. These include (1) independence between the overall fit in the baseline model and the change in AFI witnessed with the imposed model constraints ( AFI), (2) an AFI should not be affected by model complexity, and (3) a lack of redundancy with other AFIs. The first of these criteria is important because the degree to which sampling error is present in the data should influence the baseline and constrained models to the same degree. The extent to which this is true will be manifest via a lack of correlation between the initial AFI value and the AFI associated with the additional constraints on the model. Cheung and Rensvold investigated the performance of twenty AFIs with regards to these three criteria. These twenty included 2, 2/df (Wheaton, Muthen, Alwin, & Summers, 1977), Root Mean Squared Error of Approximation (RMSEA; Steiger, 1989), the Noncentrality Parameter (NCP; Steiger, Shapiro, & Browne, 1985), Akaike’s Information Criterion (AIC; Akaike, 1987), Browne and Cudeck’s Criterion (1989), the Expected Cross-Validation Index (ECVI; Browne & Cudeck, 1993), Normed Fit Index (NFI; Bentler & Bonett, 1980), Relative Fit Index (RFI; Bollen, 1986), Incremental Fit Index (IFI; Bollen, 1989), Tucker-Lewis Index (TLI; Tucker & Lewis, 1973), Comparative Fit Index (CFI; Bentler, 1990), Relative Non-Centrality Index (RNI; McDonald & Marsh, 1990), Parsimony-Adjusted NFI (James, Muliak, & Brett, 1982), Parsimonious CFI (Arbuckle & Wothke, 1999), Gamma-hat (Steiger, 1989), rescaled AIC (Cudeck & Browne, 1983), CrossValidation Index (CVI; Browne & Cudeck, 1989), McDonald’s (1989) Non-Centrality Index, and Critical N (Hoelter, 1983).

AFIs in Measurement Invariance In order to assess the AFIs, Cheung and Rensvold (2002) simulated data under a variety of conditions varying the number of factors, factor variances, correlations between factors, number of items per factor, factor loadings, and sample size. Importantly, they only simulated data that had no LOI in the population. They then conducted ANOVAs in order to determine the effect of the number of items, factors, and the interaction between the two on the AFIs. Of the AFIs, only RMSEA was immune from all simulated factors. They also examined the correlation between the initial AFI value and the AFI. Using this criterion, only NCP, IFI, CFI, RNI, Gamma-hat, McDonald’s NCI, and Critical N showed insignificant correlations. Moreover, using a six-way ANOVA, they found that of the indices mentioned above, only NCP and Critical N showed a dependence on sample size accounting for more than 5% of the variance in the change in the fit index. Given their results, they suggested only reporting results of CFI, Gammahat, and McDonald’s NCI as INI and RNI correlated extremely highly with CFI. The current study In this study, we expand on the work of Cheung and Rensvold (2002) by assessing the utility of differences in AFIs ( AFIs) for detecting a lack of MI in item factor loadings. In order to achieve this goal, we simulated data under a constant factor model in two groups. Several conditions of sample size and differential functioning (DF) of item factor loadings between groups were then simulated. METHOD In order to evaluate the performance of AFIs for detecting an LOI, we simulated item-level data for one group, then modified the properties of these data in several ways for some items (our DF items) in order to simulate item-level data for another hypothetical group. We decided to investigate the potential of AFIs for detecting an LOI in factor loadings. While there is some consensus that tests of item intercepts are also necessary for establishing MI, tests of factor loadings always occur before tests of item intercepts (Vandenberg & Lance, 2000) and thus seemed a good starting point in this initial investigation of the feasibility of these indices to evaluate MI. Initial Data Properties An initial structural model was developed for two correlated eight-item scales representing “Group 1.” Several conditions of “Group 2” data were created by modifying Group 1 data to simulate

3

DF in factor loadings for some items. Group 1 item intercepts were set at zero for all data and uniqueness terms were created so that item variance was equal to unity. Moreover, a population correlation of .3 between the latent factors was constant across all study conditions (cf. Cheung & Rensvold, 2002). Factor loadings for Group 1 and Group 2 can be seen in Table 1. Once population data were simulated, sampling error was introduced into simulated sample data. Three-hundred sample replications, each containing sampling error, were simulated for each of the study conditions. -----------------------------------Insert Table 1 about here -----------------------------------The study design constituted a 5 (sample size) x 20 (magnitude of DF) fully crossed design. Sample sizes from 100 to 500 were simulated in increments of 100 for both Group 1 and Group 2 data. Sample sizes were always equal in Group 1 and 2 MI comparisons. We simulated DF for 4 of 16 items, with two DF items per factor. The amount of DF in item factor loadings varied from a difference between groups of .02 to .40 in increments of .02. These differences in factor loadings were created by subtracting the amount of DF from the Group 1 factor loading in order to create the Group 2 factor loading for items indicated as DF in Table 1. The magnitude of DF across the DF items was uniform in all conditions. Analyses A CFA baseline model was estimated in which the correct factor structure (see Table 1) was specified for both Group 1 and Group 2. Next, a constrained model was estimated in which the entire factor loading matrix was constrained to be equal for the Group 1 and Group 2 data. Correlation matrices were analyzed and factor variances were standardized in order to achieve model identification for all conditions. Results from models with standardized latent variances are equal to those using referent indicators when latent variances are known to be invariant across groups. A probability value of .05 was used in computing LRTs; LISREL 8.54 (Jöreskog & Sörbom, 1996) was used for all analyses. We also examined the change in several AFIs between baseline and constrained models, focusing on the AFIs found to be most promising by Cheung and Rensvold (2002). Their study revealed that many AFIs had the disadvantageous property of being correlated with initial model fit; thus, we focused on those AFIs found not to have this property. Specifically, we concentrated our investigation and reporting of results on the CFI,

AFIs in Measurement Invariance Gamma hat, McDonald’s NCI, NCP, IFI, RNI, and Critical N. We also examined RMSEA as that index was found by Cheung and Rensvold to be independent of model complexity. We were primarily concerned with identifying AFIs that were both (1) sensitive to the magnitude of DF, and (2) not sensitive to sample size. Thus, we assessed the suitability of each AFI by conducting ANOVAs using SAS’s Proc GLM. In each model, the AFI was entered as the dependent variable, with sample size and magnitude of DF as predictors. We then calculated 2 effect size measures for the magnitude of DF, sample size, and the interaction between the two. Optimal AFIs are identified by displaying large 2 values for level of DF and small 2 values for both the sample size and the interaction between sample size and level of DF. We also graphed the relationship between AFIs and the amount of DF simulated. Such graphs provide a visual indication of the relationship between the AFIs and the amount of DF present. Moreover, they present information much more succinctly than a series of large tables. These graphs feature the amount of DF simulated on the x-axis with the value of the change in the fit statistic on the y-axis. RESULTS None of the 60,000 analyses resulted in convergence errors or inadmissible solutions. The 2 effect size estimates for level of DF, sample size, and the interaction between the two are presented in Tables 2 and 3. While the data in these tables are the same, Table 2 sorts fit indices by the effect of DF while Table 3 sorts the AFIs by the effect of sample size and the interaction between sample size and level of DF. -----------------------------------------Insert Tables 2 and 3 about here -----------------------------------------As can be seen in Tables 2 and 3, all AFIs outperform chi-square in both being responsive to DF and in being insensitive to sample size, with the exception of the NCP and Critical N. Because the degrees of freedom of the baseline and constrained models were the same in all study conditions, the NCP (defined as chi-square minus degrees of freedom) and chi-square had equal effect size estimates. As can be seen in the tables, no one index was superior to the others for both criteria (maximum sensitivity to DF and minimum sensitive to sample size). Gamma-hat, McDonald’s NCI, IFI, and RNI were somewhat more sensitive to DF than the other indices. CFI and RMSEA showed considerably lower effects of sample size, though RMSEA showed a

4

sizable effect due to the interaction between DF and sample size. Conversely, IFI, RNI, McDonald’s NCI, and Gamma-hat showed almost no effect of the interaction between sample size and DF, but small effects of sample size. Interestingly, the Critical-N showed considerably worse properties than did chisquare. These patterns can be seen in Figure 1 in which the level of the AFIs are plotted by DF and sample size (chi-square is plotted for comparison). Based on these results, it appears that Gamma-hat, McDonald’s NCI, IFI, RNI, and CFI are among the most promising AFIs for establishing MI. -----------------------------------------Insert Figure 1 about here -----------------------------------------We also examined the correlation between the AFIs, as highly correlated indices provide little unique information. As can be seen in Table 4, we found that McDonald’s NCI, RNI, IFI, and Gammahat were very highly correlated. Thus, like Cheung and Rensvold (2002), our results suggest reporting all four AFIs would provide largely redundant information. -----------------------------------------Insert Table 4 about here -----------------------------------------In order for the AFIs to be of utility to applied researchers, cutoff values need to be established so the indices can be used in practice. Based on their simulation work, Cheung and Rensvold (2002) suggested values of .01 for CFI, a value of .001 for Gamma-hat, and .02 for McDonald’s NCI1. They did not provide recommendations for other indices as, like this study, they found that those indices correlated so highly as to not provide unique useful information. Based on our analyses, we concur with Cheung and Rensvold (2002) such that we also recommend reporting CFI, Gamma-hat, and McDonald’s NCI. As such, we evaluated the cutoff scores recommended by Cheung and Rensvold by creating cutoff values in AFIs for these indices and plotting the percentage of significant samples in which an LOI was detected (out of 300 replications) for each level of DF for these three AFIs and the LRT. These plots can be seen in Figure 2. -----------------------------------------Insert Figure 2 about here -----------------------------------------1

Note that Cheung and Rensvold report negative values for these indices. In this study, we calculated the AFIs in order to keep AFIs values (generally) positive. We have changed the sign on the recommended cutoff values from Cheung and Rensvold to be consistent with our coding.

AFIs in Measurement Invariance As can be seen in Figure 2, it appears that Cheung and Rensvold suggested a cutoff value for the CFI that is somewhat out of line with those of the other fit indices in that their CFI value is considerably less sensitive to DF than the others. Figure 3 plots these same data, though organizing the results by fit index to allow a better visualization of the effects of sample size. As can be seen in Figure 3, none of the AFIs were unaffected by sample size. This is to be expected, however, because although the mean of the fit indices may not vary by sample size, their sampling distributions will still be affected (Marsh, Balla, & McDonald, 1988). In other words, when examining model fit to two models that fit equally well in the population, larger samples will be associated with less variation around the mean AFI than will smaller samples due to less sampling error. Thus, when comparing a constrained and baseline model with a given level of DF, larger sample sizes lead to less variation in the difference between the model AFIs and thus a higher percentage of the 300 replications in which an LOI is deemed significant than with smaller sample sizes. -----------------------------------------Insert Figure 3 about here -----------------------------------------DISCUSSION As recognition of MI as an important psychometric issue grows, an understanding of the methods used to establish the invariance of measures across groups will grow in importance. Organizational researchers working with large samples are at a disadvantage using the LRT as this test is strongly affected by sample size, potentially leading to excessive sensitivity of that test for detecting differences in the psychometric properties of a measure between groups. Researchers dealing with such samples have two options available: Either purposely seek out smaller sample sizes in order to minimize power of rejecting the null hypothesis of no differences between the groups with the LRT, or examine MI using AFIs. Seeking out smaller samples is unlikely to ever be the preferred course of action under basic sampling theory. Instead, researchers working with large samples are likely to increasingly rely of AFIs to establish MI. This study sought to expand on the earlier work of Cheung and Rensvold (2002) who examined the performance of many AFIs under the null hypothesis of perfect MI between groups. While their groundbreaking study showed promise, it failed to establish that any of the AFIs are able to detect an LOI when it exists. Thus, our study builds on their earlier work in several important ways. First,

5

we examined the extent to which the AFIs are sensitive to DF. Second, we examined insensitivity to sample size when an LOI exists. Third, we have demonstrated the relationship between several AFIs and many levels of DF for several sample size conditions. Fourth, we evaluate the power (% significant analyses) for the AFIs using Cheung and Rensvold’s (2002) recommended cutoff values. The results of our study largely concur with those of Cheung and Rensvold (2002) in that we found that CFI, Gamma-hat, and McDonald’s NCI were among the most promising AFIs in that they were (1) less sensitive to sample size than was chi-square, (2) more sensitive to DF than chi-square, and (3) generally provided non-redundant information with other AFIs. However, we found that Cheung and Rensvold’s recommended cutoff values affected the performance of the AFIs for detecting an LOI. In particular, the recommended value for CFI seems excessively large. For example, when 4 of 16 items showed DF with factor loadings differences of .3, power to detect this difference was below 50% in all sample size conditions with the CFI (see Figure 3). In contrast, the Gamma-hat, McDonald’s NCI, and the LRT all showed power near 100% for sample sizes of 200 and larger for these data. In this study, the McDonald’s NCI of .02 seemed to perform optimally of the four indices. The LRT and Gamma-hat seemed overly sensitive to small (5000) as a condition. At these large sample sizes, power to

AFIs in Measurement Invariance detect an LOI is very high with the LRT, and it may well be that researchers dealing with these large sample sizes may be the most likely to pursue using a AFI to evaluate MI. Third, we simulated data that were somewhat idealized as compared to that simulated by Cheung and Rensvold (2002). Our factor model was simulated to be ‘clean’. In other words, our population model used to derive our sample replications had zero values for crossloadings. While our choice of factor model was no more arbitrary than was Cheung and Rensvold’s (2002), the better fit associated with our model may be less likely to be encountered in practice. We consider our study to be an initial expansion of earlier work on AFIs for evaluating MI. Future research needs to address the performance of these indices in identifying an LOI in item intercepts, uniqueness terms, factor variances and covariances, and latent means. Also, the effects of model misspecification and model complexity on AFIs need to be examined under conditions in which MI does not hold. Importantly, a follow-up study that included very large sample sizes would also be valuable. In sum, it appears that examining AFIs may be a valuable tool for establishing MI. These indices could supplement or replace the LRT for some data conditions. However, further study is needed before widespread implementation should proceed. REFERENCES Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52, 317-322. Arbuckle, J. L., & Wothke, W. (1999). Amos 4.0 user's guide. Chicago: SmallWaters. Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238-246. Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin, 88, 588-606. Bollen, K. A. (1986). Sample size and Bentler and Bonett' s nonnormed fit index. Psychometrika, 51, 375-377. Bollen, K. A. (1989). A new incremental fit index for general structural equation models. Sociological methods and research, 17, 303316. Brannick, M. T. (1995). Critical Comments on Applying Covariance Structure Modeling. Journal of Organizational Behavior, 16(3), 201-213.

6

Browne, M. W., & Cudeck, R. (1989). Single sample cross-validation indices for covariance structures. Multivariate Behavioral Research, 24, 445-455. Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equations models (pp. 136-162). Newbury Park, CA: Sage. Chan, D. (1998). The conceptualization and analysis of change over time: An integrative approach incorporating longitudinal mean and covariance structures analysis (LMACS) and multiple indicator latent growth modeling (MLGM). Organizational Research Methods, 1(4), 421-483. Chan, D., & Schmitt, N. (1997). Video-based versus paper-and-pencil method of assessment in situational judgment tests: Subgroup differences in test performance and face validity perceptions. Journal of Applied Psychology, 82(1), 143-159. Chan, D., & Schmitt, N. (2000). Interindividual differences in intraindividual changes in proactivity during organizational entry: A latent growth modeling approach to understanding newcomer adaptation. Journal of Applied Psychology, 85(2), 190210. Cheung, G. W., & Rensvold, R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9(2), 233-255. Cudeck, R., & Browne, M. W. (1983). CrossValidation of Covariance-Structures. Multivariate Behavioral Research, 18(2), 147-168. Drasgow, F. (1984). Scrutinizing psychological tests: Measurement equivalence and equivalent relations with external variables are the central issues. Psychological Bulletin, 95(1), 134-135. Facteau, J. D., & Craig, S. B. (2001). Are performance appraisal ratings from different rating sources comparable? Journal of Applied Psychology, 86(2), 215-227. Hoelter, J. W. (1983). The analysis of covariance structures: Goodness-of-fit indices. Sociological Methods & Research, 11, 325344. Holland, P. W., & Wainer, H. (1993). Differential item functioning. Hillside, NJ: Erlbaum. Horn, J. L., & McArdle, J. J. (1992). A practical and theoretical guide to measurement invariance in aging research. Experimental Aging Research, 18(3-4), 117-144.

AFIs in Measurement Invariance James, L. R., Muliak, S. A., & Brett, J. M. (1982). Causal analysis: Assumptions, models and data. Beverly Hills: Sage. Jöreskog, K. & Sörbom, D. (1996). LISREL 8: Users Reference Guide. Chicago: Scientific Software International. Kelloway, E. K. (1995). Structural Equation Modeling in Perspective. Journal of Organizational Behavior, 16(3), 215-224. Marsh, H. W. (1985). The structure of masculinity/femininity: An application of confirmatory factor analysis to higher-order factor structures and factorial invariance. Multivariate Behavioral Research, 20(4), 427-449. Marsh, H. W. (1987). The factorial invariance of responses by males and females to a multidimensional self-concept instrument: Substantive and methodological issues. Multivariate Behavioral Research, 22(4), 457-480. Marsh, H. W., & Hocevar, D. (1985). Application of confirmatory factor analysis to the study of self-concept: First- and higher order factor models and their invariance across groups. Psychological Bulletin, 97(3), 562-582. Marsh, H. W., Balla, J. R., & McDonald, R. P. (1988). Goodness-of-fit indexes in confirmatory factor analysis: The effect of sample size. Psychological Bulletin, 103(3), 391-410. McDonald, R. P. (1989). An index of goodness-of-fit based on noncentrality. Journal of Classification, 6, 97-103. McDonald, R. P., & Marsh, H. W. (1990). Choosing a multivariate model: Noncentrality and goodness of fit. Psychological Bulletin, 107, 247-255. Meade, A. W., & Lautenschlager, G. J. (2004). A Monte-Carlo Study of Confirmatory Factor Analytic Tests of Measurement Equivalence/Invariance. Structural Equation Modeling, 11(1), 60-72. Meredith, W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 525-543. Meredith, W., & Millsap, R. E. (1992). On the misuse of manifest variables in the detection of measurement bias. Psychometrika, 57(2), 289-311. Millsap, R. E. (1995). Measurement invariance, predictive invariance, and the duality paradox. Multivariate Behavioral Research, 30(4), 577-605. Ployhart, R. E., Weekley, J. A., Holtz, B. C., & Kemp, C. (2003). Web-based and paper-

7

and-pencil testing of applicants in a proctored setting: Are personality, biodata and situational judgment tests comparable? Personnel Psychology, 56(3), 733-752. Riordan, C. M., & Vandenberg, R. J. (1994). A central question in cross-cultural research: Do employees of different cultures interpret work-related measures in an equivalent manner? Journal of Management, 20(3), 643-671. Steiger, J. H. (1989). EzPATH: Causal modeling. Evanston, IL: SYSTAT. Steiger, J. H., Shapiro, A., & Browne, M. W. (1985). On the multivariate asymptotic distribution of sequential chi-square statistics. Psychometrika, 50, 253-263. Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor analysis. Psychometrika, 38, 1-10. Vandenberg, R. J. (2002). Toward a further understanding of an improvement in measurement invariance methods and procedures. Organizational Research Methods, 5(2), 139-158. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4-69. Wheaton, B., Muthen, B., Alwin, D. F., & Summers, G. F. (1977). Assessing reliability and stability in panel models. In D. R. Heise (Ed.), Sociological methodology (pp. 84136). San Francisco: Jossey-Bass.

Author Contact Info: Adam W. Meade Department of Psychology North Carolina State University Campus Box 7650 Raleigh, NC 27695-7650 Phone: 919-513-4857 Fax: 919-515-1716 E-mail: [email protected]

AFIs in Measurement Invariance

8

TABLE 1 Population Factor Loadings for Group 1 and Group 2 Data Group 1

Group 2

Item

Factor 1

Factor 2

Factor 1

Factor 2

1

.80

-

.80

-

2

.70

-

.70

-

3

.60

-

.60

-

4

.50

-

.50

-

5

.80

-

XX

-

6

.70

-

XX

-

7

.60

-

.60

-

8

.50

-

.50

-

9

-

.80

-

.80

10

-

.70

-

.70

11

-

.60

-

XX

12

-

.50

-

XX

13

-

.80

-

.80

14

-

.70

-

.70

15

-

.60

-

.60

16

-

.50

-

.50

Note: XX indicates DF item with variable magnitude of DF. Numeric Group 2 loadings are equal to their Group 1 counterparts (i.e., are not DF items).

AFIs in Measurement Invariance

9

TABLE 2 Omega-Squared Effect Size Estimates for The Amount of DF and Sample Size on AFI Indices; Sorted by the Effect of Amount of DF. Amount of DF (DF)

Sample Size (N)

DF*N

Gamma Hat

0.824

0.007

0.000

McDonald’s NCI

0.824

0.007

0.000

IFI

0.812

0.005

0.000

RNI

0.811

0.006

0.000

CFI

0.722

0.002

0.002

RMSEA

0.651

0.001

0.022

0.588

0.010

0.130

NCP

0.588

0.010

0.130

Critical-N

0.389

0.013

0.198

AFI

2

AFIs in Measurement Invariance

10

TABLE 3 Omega-Squared Effect Size Estimates for The Amount of DF and Sample Size on AFI Indices; Sorted by the Effects of Sample Size.

AFI

Amount Sample of DF Size (DF) (N)

DF*N

CFI

0.722

0.002

0.002

IFI

0.812

0.005

0.000

RNI

0.811

0.006

0.000

McDonalds NCI

0.824

0.007

0.000

Gamma Hat

0.824

0.007

0.000

RMSEA

0.651

0.001

0.022

0.588

0.010

0.130

NCP2

0.588

0.010

0.130

Critical-N

0.389

0.013

0.198

2

Note: Table sorted by the sum of the effects of N and DF*N.

AFIs in Measurement Invariance

11

TABLE 4 Correlations Between AFIs 2

CFI

Critical N

G-hat

IFI

McD NCI

RMSEA

RNI

2

1.00

CFI

0.83

1.00

Critical N

0.94

0.64

1.00

Gamma-hat

0.86

0.94

0.70

1.00

IFI

0.85

0.96

0.68

0.99

1.00

McDonald’s NCI

0.87

0.93

0.71

1.00

0.99

1.00

RMSEA

0.87

0.88

0.78

0.89

0.87

0.89

1.00

RNI

0.85

0.96

0.68

0.99

1.00

0.99

0.87

1.00

NCP

1.00

0.83

0.94

0.86

0.85

0.87

0.87

0.85

NCP

1.00

AFIs in Measurement Invariance

12

Change in Chi-Square .

FIGURE 1 Changes in AFIs by Level of DF and Sample Size 200 150

100 200 300 400 500

100 50 0 0.02

0.06

0.1

0.14

0.18

0.22

0.26

0.3

0.34

0.38

0.025 0.02

100 200 300 400 500

0.015 0.01 0.005 0 -0.005 0.02

0.06

0.1

0.14

0.18

0.22

0.26

0.3

0.34

0.38

Amount of DF Change in McDonald's NCI .

Change in Gamma Hat .

Amount of DF

0.2 0.15

100 200 300 400 500

0.1 0.05 0 -0.05 0.02

0.06

0.1

0.14

0.18

0.22

0.26

Amount of DF

0.3

0.34

0.38

AFIs in Measurement Invariance

13

Change in IFI .

0.02 0.015

100 200 300 400 500

0.01 0.005 0 -0.005 0.02

0.06

0.1

0.14

0.18

0.22

0.26

0.3

0.34

0.38

Change in RNI .

Amount of DF 0.02 0.015

100 200 300 400 500

0.01 0.005 0 -0.005 0.02

0.06

0.1

0.14

0.18

0.22

0.26

0.3

0.34

0.38

Amount of DF

Change in CFI .

0.02 0.015

100 200 300 400 500

0.01 0.005 0 -0.005 0.02

0.06

0.1

0.14

0.18

0.22

0.26

Amount of DF

0.3

0.34

0.38

Change in RMSEA .

AFIs in Measurement Invariance

14

0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 -0.005 -0.01

100 200 300 400 500

0.02

0.06

0.1

0.14

0.18

0.22

0.26

0.3

0.34

0.38

Change in NCP .

Amount of DF

200 150

100 200 300 400 500

100 50 0 -50 0.02

0.06

0.1

0.14

0.18

0.22

0.26

0.3

0.34

0.38

Change in Critical N .

Amount of DF 600 500 100 200 300 400 500

400 300 200 100 0 -100 0.02

0.06

0.1

0.14

0.18

0.22

0.26

Amount of DF

0.3

0.34

0.38

AFIs in Measurement Invariance

15

FIGURE 2 Percentage of Significant Analyses by Level of DF and Sample Size

%Significant .

N=100 1.2 1 2 Gamma-hat McD' s NCI CFI

0.8 0.6 0.4 0.2 0 0.02 0.06

0.1

0.14 0.18 0.22 0.26

0.3

0.34 0.38

Amount of DF

% Significant .

N=200 1.2 1 2 Gamma-hat McD' s NCI CFI

0.8 0.6 0.4 0.2 0 0.02 0.06 0.1 0.14 0.18 0.22 0.26 0.3 0.34 0.38

Amount of DF

% Significant .

N=300 1.2 1 2 Gamma-hat McD' s NCI CFI

0.8 0.6 0.4 0.2 0 0.02 0.06 0.1 0.14 0.18 0.22 0.26 0.3 0.34 0.38

Amount of DF

AFIs in Measurement Invariance

16

% Significant .

N=400 1.2 1 2 Gamma-hat McD' s NCI CFI

0.8 0.6 0.4 0.2 0 0.02 0.06 0.1 0.14 0.18 0.22 0.26 0.3 0.34 0.38

Amount of DF

% Significant .

N=500 1.2 1 2 Gamma-hat McD' s NCI CFI

0.8 0.6 0.4 0.2 0 0.02 0.06 0.1 0.14 0.18 0.22 0.26 0.3 0.34 0.38

Amount of DF

AFIs in Measurement Invariance

17

FIGURE 3 Percentage of Significant Analyses by Level of DF and Sample Size

% Significant .

Change in Chi-Square 100 200 300 400 500

1 0.8 0.6 0.4 0.2 0 0.02

0.06

0.1

0.14

0.18

0.22

0.26

0.3

0.34

0.38

Amount of DF

% Significant .

Change in Gamma-hat 100 200 300 400 500

1 0.8 0.6 0.4 0.2 0 0.02

0.06

0.1

0.14

0.18

0.22

0.26

0.3

0.34

0.38

Amount of DF

% Significant .

Change in McDonald's NCI 100 200 300 400 500

1 0.8 0.6 0.4 0.2 0 0.02

0.06

0.1

0.14

0.18

0.22

0.26

Amount of DF

0.3

0.34

0.38

AFIs in Measurement Invariance

18

% Significant .

Change in CFI 100 200 300 400 500

1 0.8 0.6 0.4 0.2 0 0.02

0.06

0.1

0.14

0.18

0.22

0.26

Amount of DF

0.3

0.34

0.38

Suggest Documents