Braddy, P. W., Meade, A. W., & Johnson, E. C. (2006, April). Practical Implications of Using Different Tests of Measurement Invariance for Polytomous Measures. Paper presented at the 21st Annual Conference of the Society for Industrial and Organizational Psychology, Dallas, TX.
Practical Implications of Using Different Tests of Measurement Invariance for Polytomous Measures Phillip W. Braddy, Adam W. Meade, Emily C. Johnson North Carolina State University
Using male/female and Caucasian/African American comparison groups, this study examined the practical ramifications of using two IRT-based analytic methods, DFIT and the Likelihood Ratio Test (LRT), in assessing the measurement invariance of a 21-item leadership development scale under ten sample size conditions (e.g., 200, 500, & 1000). In nine of ten conditions, the LRT identified multiple items that exhibited DIF whereas DFIT only detected a single item with DIF in one set of analyses. Conclusions based on the LRT indicated a lack of measurement invariance for the scale, while DFIT implied near perfect measurement invariance. Thus, these findings highlight the implications of choice of analytic method on the determination of measurement invariance in applied samples.
Likert scales have been widely used in psychological research to compare different groups of respondents on various constructs and to assess the intra-individual development of respondents on constructs over time via longitudinal measurement. Although this has traditionally been viewed as a sound practice so long as the scale used was reliable and construct valid, researchers have recently acknowledged that making meaningful comparisons on a construct across time and/or groups also requires demonstrating identical psychometric properties of the scale used. Specifically, it must be demonstrated that scale items do not exhibit differential functioning (DF) across groups or time periods. A lack of DF indicates a measure is invariant in that all items function the same way for all respondents. In other words, individuals with equivalent levels of the latent construct exhibit comparable observed responses to these items across different test administrations and/or regardless of their group memberships (Embretson & Reise, 2000; Hambleton, Swaminathan, & Rogers, 1991; Potenza & Dorans, 1995). Although confirmatory factor analytic (see Vandenberg & Lance, 2000) and item response theory (IRT) methods such as Lord’s Chi-Square (Lord, 1980) and Raju’s (1988) unsigned area measure have traditionally been used to assess DF in Likert measures (Collins, Raju, & Edwards, 2000), there has recently been growing research interest in the use of two more recent IRT-based procedures: the Likelihood Ratio Test (LRT; Thissen, Steinberg, & Wainer, 1988) and the Differential Functioning of
Items and Tests (DFIT; Raju, van der Linden, & Fleer, 1995) framework. Recent research comparing these two IRT methods via Monte Carlo simulation has demonstrated that DFIT is less sensitive to differential item functioning (DIF) than is the LRT (e.g., Bolt, 2002; Meade & Lautenschlager, 2004); however, conditions under these existing simulation studies were unrealistic. Moreover, to date, no published study has simultaneously utilized these analyses on organizational (i.e., non-simulated) data to illustrate the differences in the conclusions regarding measurement equivalence that would be drawn about a Likert scale depending upon the method of analysis chosen. As such, in the present study, we conducted DIF analyses using both the LRT and DFIT on a 21-item Likert measure obtained from the Center for Creative Leadership (CCL) under different conditions of sample size to illustrate the effects of choice of methodology on invariance conclusions. Overview of the LRT The LRT provides item-level indices of differential item functioning (DIF) for two groups by testing two types of models: compact and augmented. The compact model estimates item parameters with the constraint that item parameters are equal for two groups (e.g., males vs. females). Next, a series of augmented models are estimated in which all item parameters are constrained to be equal between groups, except for that of a focal item. Model fit for the baseline and each augmented model is compared by computing the ratio:
Measurement Invariance G2 = L* (Model C) / L* (Model A), (1) where L* represents the likelihood functions, and Models C and A correspond to the compact and augmented models. The G2 statistic is asymptotically distributed approximately as a 2, and a significant G2 statistic suggests that allowing an item’s parameters to be freely estimated across groups in the augmented model significantly improves its fit over that of the compact model. Therefore, a significant G2 value indicates DIF with respect to one or more of the item’s parameters (e.g., a & b parameters; See Bolt, 2002; Camilli & Shepard, 1994; Thissen et al., 1988 for a more technical treatment). Overview of the DFIT Framework DFIT analyses provide item-level indices and a test-level index of DF. Thus, unlike the LRT that provides assessments of DF at only the item level, the use of DFIT indices allows one to determine if individual items exhibit DIF and/or if DF exists at the test level (i.e., DTF). The concept of DTF is predicated on the notion that item-level DIF can sometimes exist, but can cancel across items such that the expected test or scale scores for persons with the same ability levels do not differ. If DTF does not exist, one can often retain all items using this method, even those with DIF, if he or she is interested in only using scale scores as is traditionally done in psychological research (see Raju et al., 1995 for a review). There are two item-level DIF indices in the DFIT framework: a signed compensatory measure (CDIF) and an unsigned non-compensatory measure (NCDIF). CDIF values take on positive values if items favor the referent group (i.e., if equal levels of the latent trait, theta, are associated with a greater probability of endorsing high response options) and negative values if items favor the focal group. The CDIF index is not used to establish the DF of items; instead, the CDIF index is used to identify items for removal that likely contribute to significant DTF (Collins et al., 2000; Raju et al., 1995). The second item-level index provided by DFIT is NCDIF. Unlike CDIF, only positive values are given for this index, and NCDIF values are computed independently of other test items. Using this index with 5-point Likert scales, items are presumed to exhibit DIF when their NCDIF values exceed a cutoff criterion of .096 (see Flowers, Oshima, & Raju, 1999; Raju, 2000; Raju et al., 1995 for a more technical review). Finally, the test-level index of differential functioning is called the DTF index. DTF assesses whether differential functioning occurs at the scale level and mathematically is equal to the sum of the item CDIF values (Collins et al., 2000; Raju et al.,
2
1995). Like NCDIF, parametric 2 significance tests of differential functioning exist for the DTF index, but previous studies have indicated that these values should not be used because they are overly sensitive to low levels of differential functioning (see Raju et al., 1995). Instead, DTF is said to exist when the DTF value exceeds its cutoff criterion, which, for 5-point Likert scales, is calculated as the NCDIF cutoff (.096) times the number of items comprising the scale. When DTF is found, scale scores should not be used until items causing the scale to function differentially for the groups are eliminated using CDIF indices as previously noted. Research Comparing the LRT and DFIT in their Sensitivity to DIF Because both the LRT (e.g., Kim, 2001; Lee, 1996) and DFIT (e.g., Collins et al., 2000; Donovan, Drasgow, & Probst, 2000; Facteau & Craig, 2001) have been used in previous empirical studies and in practice to detect DIF in polytomous items, researchers have recently begun to compare the relative sensitivity of these two methods using Monte Carlo studies. In the first study, Meade and Lautenschlager (2004) compared the efficacy of these approaches for detecting DIF in simulated data under three conditions of sample size (150, 500, & 1,000). Their results revealed that the LRT was much more sensitive to identifying existing DIF than was DFIT in all conditions, with the greatest differences in their sensitivity occurring in the sample size condition of 1000. Similarly, Bolt (2002) compared the performance of the LRT and DFIT on simulated data in small and large samples. Results from this study also indicated that the LRT was more sensitive to DIF than DFIT in both conditions although Bolt noted the Type I error rate was also slightly higher for the LRT. The greater sensitivity of the LRT is not surprising, particularly with large samples. The LRT utilizes a (asymptotic) chi-square distribution to determine differences in item parameters. As a parametric statistic, chi-square shows increased power at larger sample sizes. Conversely, the NCDIF and DTF statistics yielded by the DFIT method are purposely not parametric statistics. Early investigations into the performance of NCDIF (and hence DTF) indicated that the chi-square values associated with these indices were overly sensitive to DIF (see Raju et al., 1995); thus, Raju incorporated a cutoff value for the NCDIF index. This cutoff value was intended to indicate DIF when it was practically, not statistically, significant. By focusing on practical significance, Raju and colleagues intentionally increased the amount of DIF needed in an item before the item should be considered to exhibit DF.
Measurement Invariance Importantly, though the aforementioned Monte Carlo studies have demonstrated that the LRT is more sensitive to DIF than is DFIT, the external validity of simulation studies is always an issue of concern. For example, the Bolt (2002) study simulated 30 non-DIF items and a single DIF item using item parameters derived from IRT models seldom used in practice with Likert scale data. Similarly, the Meade and Lautenschlager (2004) study simulated a very narrow set of conditions that may seldom be seen in Likert-type test data (e.g., perfect normality of simulated variables and evenly spaced item threshold parameters). In practice, applied researchers seldom have access to data as idealized as that witnessed in these two studies. Problems with kurtosis and skewness are commonplace in applications of DIF to the extent that researchers must often collapse different response options (e.g., 1 and 2) into a single category before item parameter estimation can occur. Thus, by using actual data typical of that encountered in practice, this study ensures data properties that are realistically externally valid. Additionally, while simulation studies are extremely useful for indicating the efficacy of tests to detect effects, they falter at highlighting the implications that arise when interpreting real test results. Because no known published study has compared these two methods using non-simulated data, the potential sizable effect the choice of method may have on testing implications may not be fully realized by practitioners in the testing industry. In this study, we evaluate the DF of a measure used in practice for leadership development to illustrate how differences in sensitivity to DIF may translate into conclusions drawn about the measurement invariance of Likert scales. Hypothesis 1: LRT and DFIT results will lead to different measurement invariance conclusions about a 21-item Likert scale. Specifically, we predict the LRT will identify more items as DF than will DFIT. Hypothesis 2: We anticipate that the differences in the results of the LRT and DFIT will be much more pronounced in our larger samples (1000, 1500, & 1800). Specifically, the LRT will be more sensitive at larger sample sizes than at smaller sample sizes while performance of DFIT will show no difference in sensitivity across sample sizes.
3 Method
Participants and Measures A data set of 5,396 employees from diverse organizations who were undergoing leadership development efforts at CCL comprised this study’s total sample. All study participants were assessed on a 21-item Likert measure ( = .93) taken from the Prospector® Instrument (CCL, 1994). Ratings on participants were provided by themselves or others (e.g., peers & subordinates) using a 5-point Likert rating scale (1 = strongly disagree; 5 = strongly agree). Approximately 18.7% of the participants assessed with this measure were top executives, 8.9% upper-middle managers, 25.1% middle managers, 33.4% lower-level managers, and 13.9% nonmanagment employees. Sixty-five percent of the sample was male, and the ethnic composition of participants was 68.2% Caucasian, 16.1% African American, 8.3% Asian, 3.4% Hispanic, 1.3% Native American, while 2.7% indicated “other.” Design
Random sub samples of males/females and Caucasians/African Americans were drawn from the full sample of 5,396 respondents in order to evaluate the relative sensitivity of the LRT and DFIT at different sample sizes. Specifically, we created separate data sets for six sample size conditions (200, 300, 500, 1000, 1500, & 1800) for the male/female comparsions and four sample size conditions (200, 300, 500, & 800) for the Caucasian/African American comparisons. Data Analysis Exploratory Factor Analysis (EFA). A principal axis EFA was employed to ascertain whether the study’s 21-item measure conformed to the unidimensionality assumption necessary for IRT analyses. As recommended by Hatcher (1994), we relied on scree plots, simple structure, and interpretability as the main criteria for selecting the best factor solution for our survey. We also considered Reckase’s (1979) assertion that it is appropriate to assume unidimensionality if (a) a dominant first factor accounts for at least 20% of the variance in the survey items and (b) the factor with the second largest eigenvalue only explains a small amount of the variance present. LRT. LRT analyses were conducted separately for the gender and race comparison groups under each of the study’s sample size conditions using the IRTLRDIF (Thissen, 2001) program. This program simultaneously estimates item threshold (b) and slope (a) parameters for the referent and focal groups using the graded response model and yields
Measurement Invariance G2 statistics (described earlier) for each item. The G2 statistics are used to detect DIF due to a and/or b parameters. DFIT. DFIT analyses were conducted separately for the gender and race comparison groups under each of the study’s sample size conditions following the procedure outlined by Raju et al. (1995). Specifically, item and person parameters were initially estimated for both the referent and focal groups via the graded response model using Multilog 7.03 (Thissen, 1991). Item and person parameters were then linked using the Equate 2.1 (Baker, 1995) program in which the focal groups’ parameters were transformed to the metric of the referent groups using Stocking and Lord’s (1983) test characteristic curve method. Finally, DFIT statistics (NCDIF, CDIF, & DTF) were obtained using the DFITPS6 program (Raju, 2000). Following the guidelines of Raju (2000), the DTF values for the male/female and Caucasian/African Americans comparisons were compared to their cutoff criterion of 2.016. The DFIT program calculates this criterion as the number of scale items (21) times the recommended NCDIF item cutoff value of .096 for five-point Likert rating scales. Sensitivity to DF for both the LRT and the NCDIF index was measured as the number of items detected as exhibiting DIF. Results Unidimensionality Results of an EFA indicated that the factor with the highest eigenvalue accounted for 71.44% of the variance present and that all 21 items (See Table 1) had loadings on this factor above .40. Conversely, the two-factor solution only explained an additional 6.58% of the variance in the item responses, and simple structure in the rotated solution was not attained due to numerous cross-factor loadings. Therefore, we concluded that the unidimenionality assumption was met. Note that EFAs were conducted separately for males, females, Caucasian, and African American samples with results nearly identical to those of the entire sample. The items comprising this factor shared a dominant theme that can best be characterized as “willingness to learn from experiences.” LRT Findings In the male/female comparisons, the LRTs identified multiple items that exhibited a and/or b parameter DIF in each of the six sample size conditions. Specifically, three items were detected to exhibit DIF in the sample size condition with 200 respondents, moderate numbers of items (6, 4, & 6, respectively) displayed DIF in the sample sizes of
4
300, 500, and 1000 respondents, and the greatest numbers of DIF items (8 & 14, respectively) were found in the sample size conditions with 1500 and 1800 participants. See Tables 2 and 3 for a complete list of these items and their respective magnitudes of DIF. The LRT also detected items that exhibited a and/or b parameter DIF in the Caucasian/African American comparisons in each of their four sample size conditions. While only one item was identified to have DIF in the sample size condition of 200 respondents, four, five, and seven items were respectively detected to have DIF in the sample size conditions consisting of 300, 500, and 800 respondents. See Tables 4 and 5 for a complete list of these items and their respective magnitudes of DIF. DFIT Findings DTF values for the male-female comparisons in the six sample size conditions (200, 300, 500, 1000, 1500, & 1800) were .04, .10, .10, .03, .02, and .04, respectively. DTF values for the Caucasian/African American comparisons in their four sample size conditions (200, 300, 500, & 800) were respectively .01, .04, .12, and .05. Because each of these obtained DTF values was well below the cutoff criterion of 2.016, we concluded that no DTF existed for any of our comparisons. Moreover, itemlevel NCDIF indices indicated that each individual item was free of DIF (i.e., they had NCDIF values less than .096) in the male/female comparisons as well; however, in the Caucasian/African American comparisons, Item 19 was found to have DIF (NCDIF value = .12) in the sample size condition with 300 respondents. Thus, no DF at the scale level was found and only one item was identified to have DIF by the DFIT analyses. Discussion Although recent Monte Carlo studies (e.g., Bolt, 2002; Meade & Lautenschlager, 2004) have demonstrated that the LRT is more sensitive to DF than is DFIT, this prior work is limited in two respects. First, the data conditions simulated in those studies were somewhat unrealistic. Second, by their very nature, simulation studies have difficulty in emphasizing how these differential sensitivities to DF translate into practical differences in the conclusions drawn about measurement equivalence. Importantly, the present study revealed two important findings. First, congruent with the results of Bolt (2002) and Meade and Lautenschlager (2004), the LRT identified many items with a and/or b parameter DIF in both sets of comparisons, whereas DFIT only detected a single item to display DIF in the
Measurement Invariance Caucasian/African American comparison. Second, the LRT identified many more items with DIF in larger samples (1000, 1500, & 1800) than in smaller samples (200, 300, 500, & 800). As the LRT is (asymptotically) distributed as chi-square, the greater sensitivity at larger sample sizes was not surprising. The primary implication of the study’s findings is that the type of DIF analysis employed can lead to very different conclusions about the invariance of a measure used across samples. Our results via the LRTs led us to the conclusion that our scale should not be used to compare racial or gender groups due to widespread DIF. Conversely, our DFIT results indicated that we had near perfect measurement invariance (thus Hypothesis 1 was supported). While the differences in measurement invariance conclusions drawn about our scale were fairly large in all sample size conditions, these differences were particularly pronounced in the larger samples (e.g., 1000, 1500, & 1800), supporting Hypothesis 2. From a practical standpoint, if following the guidelines of Thissen, Steinberg, and Wainer (1993), we would have discarded up to 14 of 21 scale items using the LRT results to prevent potential test bias; by contrast, we could have retained all our items for future use when interpreting the DFIT results. If we revised our scale based on the LRT results, our measure would be strictly invariant, but it would also likely exhibit significantly lower reliability given the fewer items retained than when using the DFIT results; the reverse interpretation would be true if we used the DFIT results. This large discrepancy in our findings between the LRTs and DFIT may primarily be attributed to the fact that these two approaches rely on different decision rules for detecting DIF. While DFIT uses cutoff guidelines, LRTs rely on statistical significance testing, which is much more sensitive to sample size. Compared to the previous simulation studies (Bolt, 2002; Meade & Lautenschlager, 2004) that examined the relative sensitivity of DIF using DFIT and the LRT, the current study has two notable advantages. First, whereas simulation studies compare these methods under idealized conditions (e.g., perfect multivariate normality; evenly spaced item threshold parameters; rare presence of DIF), we applied these analyses to a non-simulated data set that included variables that were not perfectly normally distributed (See Table 6); thus, one can put more confidence in the external validity of this study’s findings because most practitioners using these methods would also likely be analyzing data that depart from multivariate normality. A second advantage that pertains to the external validity of these findings is that our study compared DFIT and the LRT under a greater range of sample size
5
conditions, illustrating the differences in these methods’ results that may occur at a variety of sample sizes likely to be encountered by practitioners and researchers. As such, it provides more information to data analysts, which enables them to make informed decisions regarding the approximate differences in the measurement invariance conclusions that would be made when using the two analyses with various sample sizes. Limitations and Future Research The findings of the present study should be viewed in the context of four limitations. First, though these results may suggest the LRT is much more sensitive to DF than is DFIT, it should be noted that the true magnitude of population DIF in this study’s 21-item measure was not known. To this extent, simulation studies provide an advantage of having a known effect size of DIF by nature of the simulation. However, simulation studies always struggle with issues of external validity with respect to the data generation conditions, a problem not inherent in this study. We caution our readers not to draw precise conclusions regarding the extent to which DF may have been under-estimated by DFIT and/or over-estimated by the LRTs. This study’s findings should only be interpreted as an illustration of the practical differences in the decisions that would be made about the measurement invariance of our Likert scale when relying on these two IRT methods. A second limitation of the present study was its limited scope. Namely, due to the limitations of our archival data set, we were not able to make DIF comparisons for Caucasians/African Americans under large sample size conditions (e.g., 1000, 1500, & 1800). We also used only the LRT and DFIT to assess the measurement equivalence of our Likert scale. With regard to the latter limitation, future research should examine the practical differences of using the LRT, DFIT, and more traditional tests (e.g., Confirmatory Factor Analysis) when evaluating the measurement invariance of Likert scales. A third limitation of this study was that we used only one random sample per sample size condition to conduct our DIF analyses. Though our results illustrate more discrepant results between the LRT and DFIT in the larger samples as compared to the smaller samples, they did not uniformly show the linear relationship between sample size and the number of items detected to have DIF that likely would have occurred if multiple random samples per sample size condition had been used in all DIF analyses. A final limitation was that our samples were formed by combining ratings from multiple rater
Measurement Invariance sources (e.g., peers and subordinates); however, we do not view this as a serious concern for two reasons. First, previous research has demonstrated the psychometric equivalence of multi-source data (e.g., Facteau & Craig, 2001; Maurer, Raju, & Collins, 1998). Second, the primary focus of this study was to illustrate the differences in conclusions drawn about the measurement invariance of Likert scales when using the LRT and DFIT, not on interpreting the underlying causes of DIF. Conclusions The results of the present study illustrate the considerable difference in sensitivity to DF for the LRT and DFIT methodologies. These results also illustrate the practical differences that would exist in the conclusions drawn about the measurement invariance of Likert scales when applying these two methods of analysis to real (i.e., non-simulated) data. Specifically, the DFIT results suggested this study’s scale exhibited almost perfect measurement invariance, whereas the LRT implied that there was a lack of measurement equivalence for the scale. Importantly, these findings suggest that practitioners and researchers should wisely choose tests of measurement invariance for their scales, especially in large samples. If researchers have the need for a more conservative assessment of measurement invariance, perhaps the best method of analysis is the LRT. A more strict assessment of DF could be warranted, for example, when dealing with tests used for selection or other “high-stakes” testing situations. In such situations, strict invariance of measures would be desirable and potentially necessary in order to meet legal requirements of tests. In other situations, a more liberal test of DF may be preferable in which DFIT may be the more suitable choice. For instance, researchers primarily interested in correlations between variables (for example, job satisfaction and organizational culture) may need some assurance that the measures are generally functioning equivalently across groups, but strict invariance at all levels of the latent trait may not be necessary. Given the wide discrepancy between the conclusions reached given these two parallel methodologies, we encourage future research to address situations under which one approach may be preferable to another. References Baker, F. B. (1995). EQUATE 2.1: Computer program for equating two metrics in item response theory [Computer program]. Madison: University of Wisconsin, Laboratory of Experimental Design.
6
Bolt, D. M. (2002). A Monte Carlo Comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15, 123-141. Camilli, G. & Shepard, L. A. (1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage. Center for Creative Leadership (1994). Prospector®. Greensboro, NC: Author. Collins, W. C., Raju, N. S., & Edwards, J. E. (2000). Assessing differential functioning in a satisfaction scale. Journal of Applied Psychology, 85, 451-461. Donovan, M. A., Drasgow, F., & Probst, T. M. (2000). Does computerizing paper-andpencil job attitude scales make a difference? New IRT analyses offer insight. Journal of Applied Psychology, 85, 305-313. Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists. Mahway, NJ: Lawrence Erlbaum. Facteau, J. D., & Craig, S. B. (2001). Are performance appraisal ratings from different rating sources comparable? Journal of Applied Psychology, 86, 215-227. Flowers, C. P., Oshima, T. C., & Raju, N. S. (1999). A description and demonstration of the polytomous-DFIT framework. Applied Psychological Measurement, 23, 309-326. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: SAGE Publications, Inc. Hatcher, L. (1994). A step-by-step approach to using SAS for factor analysis and structural equation modeling. Cary, NC: SAS Institute. Kim, M. (2001). Detecting DIF across the different language groups in a speaking test. Language Testing, 18, 89-115. Lee, K. O. (1996). Application of the graded response model to the revised Tennessee self-concept scale: unidimensionality, parameter invariance, and differential item functioning. Dissertation-AbstractsInternational-Section-A:-Humanities-andSocial Sciences, 57 (4-A), 1584. Lord, F. (1980). Applications of item response theory to practical testing problems. Hillside, NJ: Erlbaum. Maurer, T. J., Raju, N. S., & Collins, W. C. (1998). Peer and subordinate performance appraisal measurement equivalence. Journal of Applied Psychology, 83, 693-702. Meade, A. W., & Lautenschlager, G. J. (2004, April). Same question, different answers: CFA and two IRT approaches to measurement
Measurement Invariance invariance. Paper presented at the 19th Annual Conference of the Society for Industrial and Organizational Psychology, Chicago, IL. Potenza, M. T., & Dorans, N. J. (1995). DIF Assessment for polytomously scored items: A framework for classification and evaluation. Applied Psychological Measurement, 19, 23-37. Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53, 495-502. Raju, N. (2000). Notes accompanying the differential functioning of items and tests (DFIT) computer program. Raju, N. S., van der Linden, W. J., & Fleer, P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353-368. Reckase, M. D. (1979). Unifactor latent trait models applied to multi-factor tests: Results and implications. Journal of Educational Statistics, 4, 207-230. Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210. Thissen, D. (1991). MULTILOG users’ guide: Multiple categorical item analysis and test scoring using item response theory [Computer Software]. Chicago: Scientific Software International. Thissen, D. (2001). IRTLRDIF v.2.02b: Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning [Computer Software]. Chapel Hill, NC: LL Thurstone Psychometric Laboratory. Thissen, D., Steinberg, L., & Wainer, H (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer & H. I. Braun (Eds.), Test validity, (pp. 147-169). Hillsdale, NJ: Erlbaum. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67113). Hillsdale, NJ: Lawrence Erlbaum. Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3, 4-69.
7 Author Contact Info:
Phillip W. Braddy Department of Psychology North Carolina State University Campus Box 7650 Raleigh, NC 27695-7650 Phone: 919-515-2251 Fax: 919-515-1716 E-mail:
[email protected] Adam W. Meade Department of Psychology North Carolina State University Campus Box 7650 Raleigh, NC 27695-7650 Phone: 919-513-4857 Fax: 919-515-1716 E-mail:
[email protected] Emily C. Johnson Department of Psychology North Carolina State University Campus Box 7650 Raleigh, NC 27695-7650 Phone: 919-515-2251 Fax: 919-515-1716 E-mail:
[email protected] Authors’ Note We would like to thank the Center for Creative Leadership (CCL) for providing us with the archival data set used in this study.
Measurement Invariance
8
Table 1 “Willingness to Learn from Experiences” Survey Items and Factor Loadings Estimated by EFA Survey Items
Factor Loadings
1. Has grown over time
.61
2. Is able to pull people together around a common goal
.70
3. Learns from experience
.72
4. Can make mid-course corrections
.69
5. Is not threatened by criticism
.63
6. Tries very hard to have a positive impact on the business
.64
7. Knows how the various parts of the organization fit together
.54
8. Pursues feedback even when others are reluctant to give it
.65
9. Deals well with failure
.60
10. Is passionate about seeing the business succeed
.59
11. Is willing to go against the grain
.48
12. Is not afraid to ask others about his/her impact on them
.64
13. Is not self-promoting or arrogant
.54
14. Is quick to change his/her behavior to fit with a new environment
.51
15. Is good at asking insightful questions
.65
16. Takes personal as well as business risks
.45
17. Seeks experiences that will change his/her perspective
.61
18. Has a special talent for dealing with people
.70
19. Respects people who are different from him/herself
.66
20. Is open and candid with other people
.66
21. Is not threatened by people who are better at some things than he or she is
.67
Measurement Invariance
Table 2 Items Identified as having B-Parameter DIF by the LRT in the Male-Female Group Comparison _______________________________________________________________________________________________________ Females Males ____________________________ ____________________________ Sample Size Condition Item G2 A B1 B2 B3 B4 A B1 B2 B3 B4 _______________________________________________________________________________________________________ 200 5 14.20* 1.95 -2.07 -1.17 .04 1.57 1.35 -2.06 -1.07 .50 1.31 8 13.10* 1.85 -1.89 -.74 .69 1.84 1.46 -2.40 -.90 .29 2.17 1.51 -2.80 -1.51 -.28 1.21 13 13.60* 1.11 -2.29 -1.56 -.13 1.55 300 14 15.70** 1.21 -3.21 -.46 .83 2.47 1.03 -3.36 .10 1.42 3.40 17 12.10* 1.47 -2.96 -1.28 .51 2.80 1.16 -3.63 -1.42 .56 2.36 21 13.80* 1.89 -2.33 -1.49 -.11 1.54 1.75 -1.88 -1.16 .14 1.47 500 9 11.50* 1.41 -1.95 -.43 1.28 3.44 1.40 -1.85 -.18 1.33 2.87 14 14.40* 1.09 -3.51 -.34 1.02 2.79 .92 -4.04 .10 1.41 3.08 1.65 -1.84 -.37 1.16 2.76 1000 9 13.90* 1.40 -2.15 -.059 1.07 2.93 13 22.30*** 1.31 -2.14 -1.55 -.14 1.26 1.32 -2.66 -1.84 -.41 1.10 14 11.50* 1.19 -3.24 -.46 .86 2.49 1.29 -3.10 -.22 1.02 2.62 19 30.20*** 1.83 -2.49 -1.55 -.15 1.40 1.78 -2.82 -1.93 -.34 1.11 20 12.10* 1.86 -2.22 -1.51 -.08 1.34 1.69 -2.82 -1.57 -.11 1.44 1.98 -2.53 -1.60 -.02 1.63 1500 4 15.40** 2.02 -2.99 -1.78 -.13 1.68 5 27.80*** 1.49 -2.06 -1.15 .25 2.04 1.68 -1.60 -.86 .42 2.01 9 17.80** 1.32 -2.16 -.48 1.18 3.08 1.56 -1.80 -.28 1.23 2.50 13 18.20** 1.31 -2.14 -1.52 -.16 1.36 1.27 -2.57 -1.77 -.33 1.18 14 28.00*** 1.03 -3.49 -.47 .97 2.82 1.11 -3.40 -.10 1.23 2.89 .52 1.72 1.88 -1.66 -.86 .44 1.53 18 12.50* 1.82 -1.59 -.71 19 44.70*** 1.80 -2.61 -1.48 -.06 1.45 1.72 -2.81 -1.91 -.26 1.21 21 15.00* 1.90 -2.27 -1.37 -.05 1.36 1.80 -2.01 -1.27 -.07 1.37 1800 1 12.10* 1.53 -3.08 -1.70 .01 1.73 1.42 -3.17 -1.82 .09 1.79 4 12.90* 2.16 -2.75 -1.64 -.02 1.65 1.99 -2.54 -1.58 .00 1.62 5 25.00*** 1.62 -1.91 -1.03 .34 2.01 1.64 -1.64 -.87 .44 2.04
9
Measurement Invariance
7 9 10 13 14 18 19 20 21
14.50* 17.5** 14.80* 32.30*** 30.30*** 16.70** 50.40*** 14.90* 14.60*
1.18 1.35 1.43 1.27 1.14 1.89 1.86 1.83 1.91
-3.60 -2.05 -3.37 -2.18 -3.26 -1.53 -2.48 -2.26 -2.26
-2.33 -.42 -2.15 -1.51 -.39 -.67 -1.46 -1.48 -1.30
-.35 1.23 -.57 -.11 .94 .53 -.03 .02 -.03
10
1.50 3.22 .99 1.42 2.70 1.64 1.41 1.40 1.37
1.28 1.58 1.42 1.27 1.15 1.93 1.74 1.68 1.82
-3.15 -1.77 -.366 -2.61 -3.29 -1.64 -2.75 -2.64 -2.00
-1.90 .22 -.26 1.25 -2.05 -.42 -1.76 -1.33 -.05 1.70 -.84 .45 -1.87 -.23 -1.74 -.05 -1.26 -.03
1.55 2.83 1.17 1.14 2.77 1.52 1.19 1.55 1.39
Note. * p