Arthritis Care & Research Vol. 63, No. 8, August 2011, pp 1159 –1169 DOI 10.1002/acr.20491 © 2011, American College of Rheumatology
ORIGINAL ARTICLE
Disease-Related Differential Item Functioning in the Work Instability Scale for Rheumatoid Arthritis: Converging Results From Three Methods KENNETH TANG
AND THE
CANADIAN ARTHRITIS NETWORK WORK PRODUCTIVITY GROUP
Objective. The 23-item Work Instability Scale for Rheumatoid Arthritis (RA-WIS) is a promising measure to assess risk for future work disability. Validated in both rheumatoid arthritis (RA) and osteoarthritis (OA), it has high potential for cross-disease applications. Our objective was to examine disease-related differential item functioning (DIF) in the RA-WIS. Methods. Workers with RA (n ⴝ 120) or OA (n ⴝ 130) were recruited from 3 sites and completed a questionnaire consisting of demographic and health- and work-related variables, including the RA-WIS (range 0 –23, where 23 ⴝ highest work instability). Multiple DIF detection methods were applied for comparability: 1) Mantel-Haenszel and Breslow-Day procedures, 2) hierarchical 3-step sequential logistic regression procedure, and 3) a 1-parameter item response theory approach (Rasch analysis). Both tests of significance (chi-square and F tests) and effect size statistics (⌬MH, ⌬R2) were assessed to confirm items demonstrating uniform or nonuniform DIF. A 2-step purification procedure was applied to establish a DIF-free conditioning variable (total RA-WIS score) for DIF analyses. The resultant impact of disease-related DIF at the scale level was also evaluated. Results. All 3 DIF detection methods converged to reveal 3 RA-WIS items as having significant disease-related uniform DIF. Two items (“difficulty opening doors” and “pressure on hand”) were more likely affirmed in RA, while 1 item (“very stiff”) was more likely affirmed in OA. Overall, only a marginal impact at the scale level was found due to a small proportion of scale items exhibiting DIF and the bidirectional nature of DIF effects. Conclusion. RA-WIS scores can be directly compared between RA and OA without significant concerns for DIF-related measurement bias.
INTRODUCTION Work disability associated with rheumatoid arthritis (RA) or osteoarthritis (OA) is an important concern in the working population (1–3). Measures that can identify workers experiencing difficulties meeting their job demands are needed to facilitate early interventions, and to help miniSupported by an unrestricted grant from Abbott and by the Canadian Arthritis Network (part of the Networks of Centres of Excellence) in partnership with The Arthritis Society of Canada. Mr. Tang is recipient of a Canadian Institutes of Health Research PhD Fellowship, Canadian Arthritis Network Graduate Award, and Syme Fellowship from the Institute for Work & Health. Kenneth Tang, MSc(PT), MSc: University of Toronto, Li Ka Shing Knowledge Institute of St. Michael’s Hospital, and Institute for Work & Health, Toronto, Ontario, Canada. Address correspondence to Kenneth Tang, MSc(PT), MSc, Mobility Program Clinical Research Unit from the Li Ka Shing Knowledge Institute of St. Michael’s Hospital, 30 Bond Street, Toronto, Ontario, Canada, M5B 1W8. E-mail:
[email protected]. Submitted for publication December 10, 2010; accepted in revised form April 19, 2011.
mize the risk for more adverse outcomes such as permanent work loss. Recent evidence suggests the Work Instability Scale for Rheumatoid Arthritis (RA-WIS) is a promising measure for such a purpose (4,5). The RA-WIS is a self-report multi-item scale that examines a wide range of constructs, including perceptions of performance and stamina at work, issues related to time management, symptom control, as well as cognitive distresses. Difficulties with these constructs contribute to what developers have termed “work instability” (WI), defined as a mismatch between a person’s functional and cognitive abilities in relation to demands at work (4). The RA-WIS has been validated in both RA (4,6) and OA (6,7), which is important for encouraging standardized applications of this outcome across multiple forms of arthritis. Yet, do RA-WIS items function consistently between RA and OA, and are scale scores directly comparable across these 2 forms of arthritis? Originally developed for RA (4), we hypothesize that some RA-WIS items may have less intrinsic relevance for other forms of arthritis, which could bias comparisons of scores across different populations. Differential item functioning (DIF) analysis can assess 1159
1160
Tang et al
Significance & Innovations ●
Diverse methods converged to identify 3 items in the 23-item Work Instability Scale for Rheumatoid Arthritis (RA-WIS) to demonstrate significant disease-related differential item functioning (DIF) between rheumatoid arthritis (RA) and osteoarthritis (OA).
●
Effects of DIF were shown to cancel each other at the scale level; therefore, cross-disease comparisons of RA-WIS summed scores between RA and OA may be conducted without significant concerns for measurement bias.
whether the probability of item response is systematically linked to respondent characteristics that are unrelated to the concept of the measure itself (8,9). Such analysis could be applied to help inform whether the RA-WIS operates consistently between RA and OA at both the item and scale levels. Evidence of “disease-related” DIF in the RAWIS could indicate some underlying difference in item relevance or in the way specific items are perceived or interpreted between different arthritis subgroups, and raise concerns for cross-disease measurement bias. If a considerable proportion of scale items exhibit such DIF, this could invalidate direct comparisons of RA-WIS scores between RA and OA. Moreover, this would also suggest the need to identify unique score cut points for each form of arthritis. Examples of popular DIF detection methods include the Mantel-Haenszel procedure (10), logistic regression (11) and, more recently, approaches based on item response theory (IRT) measurement frameworks. To date, most attention has been given to investigations of DIF associated with age (12,13), sex (12,14,15), language/translations (16 – 18), or culture (12,19,20), but few studies have examined disease-related DIF (21). Given the high relevance of work disability in both RA and OA and the potential of the RA-WIS for cross-disease applications, the current objective was to assess for disease-related DIF in this measure and its impact on the comparability of scores between RA and OA at the scale level. Multiple DIF detection methods were applied for this investigation to afford an opportunity to examine the comparability of results across different procedures.
PATIENTS AND METHODS Recruitment and sample size. A total of 250 workers who had been diagnosed with either RA (n ⫽ 120) or OA (n ⫽ 130) were recruited for a 1-year cohort study from one of the following sites: 2 tertiary-level rheumatology clinics in urban teaching hospitals in Toronto, Ontario, Canada, or an outpatient arthritis treatment program providing multidisciplinary services in Vancouver, British Columbia, Canada. To be included, study participants must have been working for pay at least 1 month prior to recruitment,
be able to understand written English, and have provided written consent for study participation. Research ethics board approval for this study has been obtained at all of the participating institutions. At study baseline, all of the participants completed a questionnaire consisting of a series of demographic and health- and work-related variables, including the RA-WIS. Measure: RA-WIS. The RA-WIS consists of 23 dichotomous items, and aims to quantify the degree of mismatch between an individual’s functional abilities and work demands (4). Participants responded to scale items by affirming (yes ⫽ 1) or not affirming (no ⫽ 0) specific work-related experiences that would indicate WI. Total scale scores range from 0 –23, where higher scores indicate greater WI. In addition to existing evidence supporting its internal consistency, validity, and responsiveness (6,7), its unidimensionality and lack of DIF for age or sex have also been demonstrated from Rasch analyses (4,7). Among our 250 eligible participants, 239 completed the RA-WIS at baseline, with ⬍10% missing entries (i.e., completed ⱖ21 of 23 items). Only data from these participants were included for the current analysis. Descriptive statistics. Descriptive statistics were used to provide an overview of the characteristics of study participants. Variables assessed included demographic information (e.g., age, sex, marital status) and health- (e.g., disease duration, Health Assessment Questionnaire [HAQ] [22]) and work-related variables (e.g., occupation type). Overview of DIF analytic strategy. Our overall approach was to apply 3 different methods to detect 2 types of DIF: uniform and nonuniform. Uniform DIF is a consistent between-group difference in item response probability across the full range of the measured (latent) trait, whereas nonuniform DIF is indicated by a varied between-group difference across the trait (group ⫻ trait interaction) (23). In this study, both tests of significance and magnitude measures (where available) were used in combination to identify both types of DIF, in accordance with previous recommendations (24,25). Some debate exists regarding the optimal statistical criteria to confirm DIF by the various methods (26,27); therefore, to minimize potential Type I errors, we have applied conservative statistical thresholds and also examined relative magnitudes or “degrees” of DIF among scale items. Method 1: Mantel-Haenszel and Breslow-Day procedures. The Mantel-Haenszel procedure (10) is a widely used approach that identifies uniform DIF based on analysis of 3-way (2 ⫻ 2 ⫻ ) contingency tables via crosstabulation of item response (column) by group (row) for every level of the “conditioning variable” (i.e., level of measured trait), where represents the number of possible scores for the measure (28). A chi-square test was initially performed to test the null hypothesis that there is no relationship between arthritis type (group) and an item being affirmed (response), controlling for the total RA-WIS score (conditioning variable). If the null hypothesis is re-
Differential Item Functioning in the RA-WIS jected (significant chi-square), then disease-related DIF is suggested. The Mantel-Haenszel procedure also provides an effect size statistic known as the “common odds ratio” (␣MH) that ranges from 0 to positive infinity, with the value of 1 indicating no DIF. For ease of interpretation, ␣MH can be transformed into a delta difference (⌬MH) using an ln formula (⌬MH ⫽ ⫺2.35 ln[␣MH]) (29) to place it on a scale that centers at 0 (ranges from negative to positive infinity). This statistic provides an indication of both DIF magnitude and direction, where a value of ⱖ|1.5| has been proposed as a significant effect size (28). For this study, we applied a threshold of a P value with Bonferroni correction (PBonf) of ⬍0.05 for the chi-square test as an initial screen, and also expected ⌬MH ⱖ|1.5| as a confirmatory test to verify uniform DIF. To complement the Mantel-Haenszel procedure, the Breslow-Day procedure was applied to assess nonuniform DIF (30,31). The Breslow-Day procedure provides a chi-square test that assesses the homogeneity of the ␣MH across different levels of the conditioning variable. That is, if the difference in the probability of item affirmation between RA and OA varied across the range of total RA-WIS scores (PBonf ⬍ 0.05), then nonuniform DIF is indicated. No effect size statistic is currently available for the Breslow-Day procedure. Method 2: logistic regression (LR). LR is also a common method for detecting DIF (11,32), and a main advantage of this approach is the ability to simultaneously test for both uniform and nonuniform DIF (33). A hierarchical 3-step sequential binary LR modeling process was applied (34): ln
冋
册
Pi ⫽ b0 ⫹ b1tot ⫹ b2group ⫹ b3(tot*group) (1 ⫺ Pi)
where pi ⫽ probability of affirming item i; b ⫽ parameter estimate; model 1: conditioning variable entered (tot ⫽ total RA-WIS score); model 2: group variable entered (group ⫽ arthritis type); and model 3: interaction term entered (conditioning ⫻ group variable). As an initial screen, we examined the discrepancy in the ⫺2 log likelihood between models 1 and 3 using a chisquare distribution with 2 df (11). If the model fit is better (significant chi-square) in model 3, then some diseaserelated DIF is suggested for the item. A recommended 1% significant level (35) was applied for this initial screen. We then performed a 2-stage method to verify and differentiate uniform and nonuniform DIF (27,36). The effect size for uniform DIF was assessed by the difference in Nagelkerke’s R2 (⌬R2) between models 1 and 2, while the effect size for nonuniform DIF was determined by ⌬R2 between models 2 and 3 (34). Criteria for negligible (⌬R2 ⫽ ⬍0.035), moderate (0.035 ⬍ ⌬R2 ⬍ 0.070), and large (⌬R2 ⫽ ⬎0.070) magnitudes of DIF have been proposed (27,37). In this study, we sought a minimum of ⌬R2 ⫽ ⬎0.035 to confirm either form of DIF. Method 3: 1-parameter IRT approach (Rasch analysis). The final approach was to apply a Rasch analysis to detect DIF, which required an initial fitting of the data to the dichotomous Rasch probabilistic model (38). This model asserts that the probability of a person endorsing an item is
1161 a logistic function of the difference between a person’s ability (level of WI []) and the level of WI, represented by the item b: pni ⫽
e(n⫺bi) 1⫹e(n⫺bi)
where pni ⫽ the probability that a person n will affirm item i. Criteria for fit to the Rasch model. Proper fit to the Rasch model and transformation to interval-level scaling require a number of criteria to be satisfied: 1) accordance of data structure to the probabilistic form of Guttman scaling (39), 2) unidimensionality (40), and 3) local independence of items. Statistical criteria to be used to evaluate the fit to the Rasch model have been extensively described in a previous review (41), and are only described briefly here. Fit of data to the Guttman structure. The item–trait interaction chi-square statistic was used to determine if adequate fit to the Rasch model has been achieved. A customary PBonf ⬎ 0.05 threshold was applied to indicate accordance with the Guttman structure. If this was not met, we would perform a sequential removal of the most misfitting item from the model (according to the magnitude of item-fit chi-square statistic) until proper fit is achieved. The Person separation index (PSI) (42) was also assessed to provide an indicator of reliability for the measure, and a minimum of ⬎0.70 was expected. Tests of unidimensionality and local independence. Tests of dimensionality were undertaken once the above statistical criteria for fit to the Guttman structure had been satisfied. This was assessed by performing a principal component analysis of the residuals to detect signs of multidimensionality within the scale (43). If a scale is unidimensional, no residual associations (factor structure) within the first residual component should exist once the factor for which item associations exist is extracted. To test this formally, we identified all positively (⬎0) and negatively (⬍0) loaded items based on the first residual component, and calculated summed scores associated with these subsets. Then, independent t-tests were conducted to compare the person logit estimates derived from these subsets (44). For a unidimensional scale, the percentage of significant tests (i.e., outside ⫾1.96) is expected to be less than 5% (or lower bound of associated 95% binomial proportions) (45). Local dependency is defined as consistency among item responses that is unaccounted for by individual differences on the measured construct (43,46). Residual correlation ⬎0.2 between any item pairs would indicate local dependency. DIF assessment. Item characteristic curves (ICCs) derived from Rasch analysis can provide a graphical illustration of the relationship between the latent trait (i.e., WI) and the probability that respondents with a given level of this trait will affirm an item (28,47). In this study, ICCs were analyzed both visually and statistically to determine the presence of DIF. An item without disease-related DIF should display overlapping ICCs for RA and OA. Uniform DIF would be indicated by a systematic shift between the 2 ICCs across the full spectrum of the latent trait, whereas nonuniform DIF would be indicated by nonparallel ICCs
1162
Tang et al
Table 1. Demographic and health- and work-related characteristics of all of the study participants (n ⴝ 239)*
Age No. available Mean ⫾ SD, years Range, years Sex No. available Women, no. (%) Marital status No. available Married, % Divorced, % Widowed, % Single, % In committed relationship, % Education level No. available High school or less, % Some college/university, % University/college/technical school graduate, % Occupation type‡ No. available Business, finance, administration, % Health, science, arts, sports, % Sales and services, % Trades, transport, equipment operators, % General health: SF-1 (range 1–5, 5 ⫽ poor) No. available Mean ⫾ SD Duration of arthritis No. available ⬍1 year, % 1–5 years, % ⬎5 years, % Self-rated arthritis severity (range 1–7, 1 ⫽ very mild and 7 ⫽ very severe) No. available Mean ⫾ SD HAQ disability (range 0–3, 3 ⫽ high disability) No. available Mean ⫾ SD Work instability: RA-WIS (range 0–23, 23 ⫽ high work instability) No. available Mean ⫾ SD
RA (n ⴝ 112)
OA (n ⴝ 127)
Difference, P†
110 46.1 ⫾ 10.6 19–64
124 53.8 ⫾ 6.7 24–65
⬍ 0.0001
111 98 (88.3)
125 101 (80.8)
111 50.5 19.8 3.6 24.3 1.8
125 59.2 20.0 2.4 13.6 4.8
111 18.0 18.9 63.1
125 15.2 16.8 68.0
108 40.7 28.7 23.2 7.4
120 45.0 40.8 10.8 3.3
112 2.8 ⫾ 0.8
126 2.6 ⫾ 1.0
109 8.3 29.4 62.4
123 10.6 41.5 48.0
112 3.0 ⫾ 1.7
127 3.4 ⫾ 1.9
ns
112 0.7 ⫾ 0.6
127 0.7 ⫾ 0.6
ns
112 7.9 ⫾ 5.9
127 8.7 ⫾ 6.8
ns
ns
ns
ns
0.02
ns
ns
* RA ⫽ rheumatoid arthritis; OA ⫽ osteoarthritis; ns ⫽ not statistically significant (P ⬎ 0.05); SF-1 ⫽ Short Form 1; HAQ ⫽ Health Assessment Questionnaire; RA-WIS ⫽ Work Instability Scale for Rheumatoid Arthritis. † Chi-square proportions test or independent t-test applied. ‡ National Occupational Classification developed by Human Resources and Skills Development Canada.
that may intersect over the range of the latent trait. Statistically, DIF was identified by a 2-way analysis of variance of the residuals, where statistical significance in the F test (PBonf ⬎ 0.05) for arthritis type (RA versus OA) or the interaction between arthritis type and level of WI would verify the presence of uniform or nonuniform DIF, respectively (48). Effect size (magnitude) statistics have yet to be established for this approach, and therefore, only tests of significance were relied upon to identify items with DIF. Rasch analysis was carried out using RUMM2020 software (49).
Purification of the conditioning variable. For all DIF analyses in this study, the recommended 2-step purification procedure (24) was applied to verify that items are not exhibiting “pseudo-DIF,” i.e., an apparent opposing DIF caused by other DIF items (50). By this procedure, an initial run was performed to identify probable DIF candidates. Then, all items were retested with a DIF-free conditioning variable (i.e., a “purified” RA-WIS scale score) except for candidate items, which were examined with the specific item included in the scale score as part of the conditioning variable (32).
Differential Item Functioning in the RA-WIS
1163
Table 2. Distribution of response for Work Instability Scale for Rheumatoid Arthritis items, by arthritis type* RA (n ⴝ 112), %
1. Slow 2. Reduce 3. Worried 4. Pain 5. Stamina 6. Holiday 7. Push 8. Face 9. Say no 10. Watch 11. Open 12. Extra 13. Frustrated 14. Give up 15. Get on 16. Tired 17. Restrict 18. Getup 19. Stiff 20. All manage 21. Stress 22. Pressure 23. Good bad
OA (n ⴝ 127), %
Yes
No
Missing
Yes
No
Missing
34.8 11.6 26.8 31.3 56.3 9.8 45.5 40.2 39.3 51.8 37.5 40.2 18.8 11.6 27.7 44.6 18.8 25.9 27.7 33.0 35.7 49.1 69.6
65.2 88.4 73.2 68.8 43.8 90.2 54.5 58.9 60.7 48.2 62.5 59.8 81.3 88.4 72.3 55.4 81.3 74.1 72.3 65.2 64.3 50.0 30.4
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.8 0.0 0.9 0.0
41.7 15.8 29.9 50.4 65.4 11.0 54.3 36.2 42.5 55.9 18.1 35.4 29.1 15.8 44.9 45.7 23.6 37.0 57.5 36.2 31.5 30.7 65.4
55.9 82.7 69.3 49.6 34.7 87.4 45.7 63.8 56.7 43.3 81.9 64.6 70.9 83.5 54.3 54.3 75.6 63.0 42.5 63.8 67.7 69.3 33.9
2.4 1.6 0.8 0.0 0.0 1.6 0.0 0.0 0.8 0.8 0.0 0.0 0.0 0.8 0.8 0.0 0.8 0.0 0.0 0.0 0.8 0.0 0.8
* RA ⫽ rheumatoid arthritis; OA ⫽ osteoarthritis.
DIF impact. In addition to identifying specific items exhibiting DIF, assessments of the overall impact at the scale level are also important (25). Therefore, if diseaserelated DIF was evident, we would examine how WI differences between RA and OA subgroups in our current study sample would be influenced if different versions of the RA-WIS (ignoring versus accounting for DIF) were applied.
RESULTS A summary of the demographic and health-related characteristics of the study sample is provided in Table 1. Only age (P ⬍ 0.0001) and occupation type (P ⫽ 0.02) differed between arthritis subgroups. The sample mean ⫾ SD of the RA-WIS was 8.3 ⫾ 6.3, and distributions of item-level response are shown in Table 2. DIF identified by Mantel-Haenszel, Breslow-Day, and LR procedures. The Mantel-Haenszel procedure identified 3 RA-WIS items that met both our statistical criteria (chi-square PBonf ⬍ 0.05, ⌬MH ⱖ|1.5|) for uniform DIF (Figure 1). Two of these items showed a systematically greater probability of being affirmed in RA (item 11: “difficulty opening doors,” item 22: “pressure on hand”), while the other item was more likely affirmed in OA (item 19: “very stiff”). No items showed significant nonuniform DIF according to Breslow-Day tests. These findings were shown to be reproducible when DIF was assessed by statistics (⫺2 log likelihood and ⌬R2) based on the LR procedure (Figure 2).
DIF identified by the IRT approach. Initial fitting of RA-WIS data to the Rasch model indicated some deviations to the Guttman pattern (item–trait interaction 2 ⫽ 134.9, df 69, P ⫽ 0.000004; PSI ⫽ 0.90). We first removed 2 misfitting items (items 18 and 20) to improve fit to the Rasch model (item–trait interaction 2 ⫽ 101.9, df 63, P ⫽ 0.001; PSI ⫽ 0.90). Preliminary DIF assessment revealed items 11, 19, and 22 with uniform DIF (Figure 3) and none with nonuniform DIF. Further model modifications were made to examine whether DIF effects would cancel out at the scale level. Testlets (super items) were created for items demonstrating local dependency (residual r ⫽ ⬎0.2), which included items 4/19 (r ⫽ 0.31), 2/14 (r ⫽ 0.30), 9/10 (r ⫽ 0.24), 1/5 (r ⫽ 0.22), and 12/13 (r ⫽ 0.22). Item 15 was then further removed due to poor item fit (fit residual ⫽ ⫺3.47, chi-square P ⫽ 0.002). Reassessment after these modifications verified DIF for items 11 and 22, and also for the testlet consisting of items 4/19. A final testlet was created to combine items/testlets exhibiting DIF, resulting in a 13-item/testlet model that met criteria for proper fit to the Rasch model (2 ⫽ 64.3, df 39, P ⫽ 0.007; PSI ⫽ 0.88) that was free of DIF. Unidimensionality was affirmed in this final model, as only 3.7% of independent t-tests comparing person logit estimates (derived from subsets of all positively versus negatively loaded items) were statistically significant. Since 2 of the 3 items exhibiting DIF (items 11 and 22) were related to hand functioning, we further examined whether there was a preexisting difference in the prevalence of “hand involvement” in our sample by comparing the HAQ gripping subscale score (range 0 –3, where 3 ⫽
1164
Tang et al
Figure 1. Forest plot illustrating the magnitude and direction of the delta difference effect size (⌬MH) of individual Work Instability Scale for Rheumatoid Arthritis (RA-WIS) items using the Mantel-Haenszel differential item functioning (DIF) detection procedure (broken lines show the ⌬MH ⱖ|1.5| threshold). * ⫽ items 11, 19, and 22 met both statistical criteria for uniform disease-related DIF (test of significance chi-square P value with Bonferroni correction ⬍0.05 and ⌬MH ⱖ|1.5|); RA ⫽ rheumatoid arthritis; OA ⫽ osteoarthritis.
most disability) between RA and OA. This was our best available indicator, as we had not collected data on arthritis location or other information specific to hand functioning. HAQ gripping scores were found to differ between arthritis types (2 ⫽ 21.04, P ⬍ 0.001), with the majority (62.2%) with OA scoring 0 (no disability), whereas more than half (58.0%) with RA had a score of 2 or 3 (high disability). Impact of DIF. With the original 23-item RA-WIS, the difference in mean ⫾ SD RA-WIS between arthritis subgroups was 0.9 ⫾ 6.4 (OA: 8.7 ⫾ 6.8, RA: 7.9 ⫾ 5.9), or a 3.9% difference. After excluding the 3 items showing disease-related DIF (i.e., 20-item RA-WIS), both groups had lower overall mean ⫾ SD scores (OA: 7.7 ⫾ 6.0, RA: 6.7 ⫾ 5.2), but the relative difference between subgroups (0.9 ⫾ 5.6) remained relatively similar. This represented only a marginally larger difference (4.5%) in the modified scale range (0 –20).
DISCUSSION The ability to apply the same work disability outcomes to different populations can be useful from the perspective of comparability; however, there is little current evidence to suggest that outcome scores derived from such measures have direct comparability when applied to different populations. Ideally, a given scale score should consistently
represent the same level of the underlying trait, but this assumption could be threatened if items do not function consistently (i.e., have bias) across clinical populations. The RA-WIS is a measure that has been independently validated in both RA and OA, and an examination of potential DIF associated with specific forms of arthritis is of interest. This study found evidence of disease-related DIF in the RA-WIS at the item level. Results from 3 different DIF detection methods converged to reveal the same 3 items to show significant uniform DIF, and 2 other items nearing the detection thresholds (Table 3); however, it was also determined that such DIF ultimately had only a minimal impact on the comparability of RA-WIS scores between RA and OA at the scale level. This was likely attributed to the fact that only a small proportion (13% [3 of 23]) of scale items showed significant DIF, and also because such DIF was bidirectional; therefore, much of the effects had “cancelled” each other at the scale level. Given the minimal overall impact, we believe the direct comparison of RA-WIS scores between RA and OA is appropriate, and thus concerns should only be reserved when cross-disease comparisons are made at the item level. What specific underlying factors might account for the observed DIF in the RA-WIS? It seems probable that disease-related DIF for items 11 (“difficulty opening doors”) and 22 (“pressure on hand”) could be directly related to subgroup differences in hand involvement, given significant differences in HAQ gripping disability between RA
Differential Item Functioning in the RA-WIS
1165
Figure 2. Forest plot illustrating the magnitude and direction of the model difference Nagelkerke’s R2 (⌬R2) effect size statistic to detect uniform and nonuniform differential item functioning (DIF) using the hierarchical 3-stage logistic regression procedure (broken lines show the ⌬R2 ⫽ ⬎0.035 threshold). * ⫽ items 11, 19, and 22 met statistical criteria for uniform disease-related DIF (⌬R2 ⫽ ⬎0.035 for difference between logistic regression models 1 and 2); RA-WIS ⫽ Work Instability Scale for Rheumatoid Arthritis; RA ⫽ rheumatoid arthritis; OA ⫽ osteoarthritis.
1166
Tang et al
Figure 3. Item characteristic curves of 2 Work Instability Scale for Rheumatoid Arthritis items demonstrating significant disease-related uniform differential item functioning as detected by a 1-parameter item response theory approach (Rasch analysis). A, Item 11, “I have great difficulty opening some of the doors at work” (workers with rheumatoid arthritis [RA] have a greater probability of affirming the item at all levels of work instability [WI]), B, Item 19, “I get very stiff at work” (workers with osteoarthritis [OA] have a greater probability of affirming the item at all levels of WI).
and OA observed in our cohort. Such DIF could be relevant whenever there is a significant imbalance in the extent of hand involvement between arthritis subgroups being compared. This illustrates an inherent challenge for items with high anatomic specificity, which may not have equal relevance to different arthritis types. One other potential implication to consider is whether such measurement bias could extend beyond direct comparisons between just RA and OA. Presumably, direct comparisons between any 2 cohorts where the extent of hand involvement is significantly unbalanced might pose similar challenges related to these 2 specific RA-WIS items. Users should be aware of the potential for measurement bias for these RA-WIS items due to their anatomic specificity. The observation that items 19 (“very stiff”) and 4 (“pain or stiffness”) met or closely approximated the statistical
threshold for DIF suggests that “stiffness” is likely the specific biased element, since it is common to both items. The opposing direction of DIF for these items (i.e., more likely affirmed in OA) was somewhat surprising. Since the RA-WIS was originally designed for RA, we had anticipated that few, if any, items would show such a strong DIF effect toward OA. A possible explanation is that while morning stiffness is prevalent in RA, it may have less impact during typical work hours in the daytime. An additional factor to consider is the potential influence of work context. Perhaps more workers with OA in our sample were simply working at jobs where prolonged sitting or standing is required, thus increasing their propensity to experience certain clinical signs (i.e., stiffness) compared to those with RA. We did find a subgroup difference in terms of occupation type, although the classification sys-
Differential Item Functioning in the RA-WIS
1167
Table 3. Summary of DIF detection methods and statistical criteria applied and study findings* Method 1: Mantel-Haenszel/ Breslow-Day procedures
PBonf ⬍ 0.05 for ANOVA F test
Items 11, 19, 22 Item 4 (chi-square P ⫽ 0.01) Item 15 (chi-square P ⫽ 0.02) None
Items 11, 19, 22 Item 4 (⌬R2 ⫽ 0.029) Item 15 (⌬R2 ⫽ 0.027)
Items 11, 19, 22 Item 4 (P ⫽ 0.01) Item 15 (P ⫽ 0.002)
None
None
Item 4 (chi-square P ⫽ 0.01)
Item 4 (⌬R2 ⫽ 0.033) Item 6 (⌬R2 ⫽ 0.031)
Item 4 (P ⫽ 0.04) Item 12 (P ⫽ 0.03)
PBonf ⬍ 0.05 for chi-square (1 df) test
Step 2 (confirmatory test): effect size statistic‡
⌬MH ⱖⱍ1.5ⱍ
Nonuniform DIF: met threshold Nonuniform DIF: near threshold
Method 3: 1-parameter IRT (Rasch analysis)†
P ⬍ 0.01 for chi-square test (2 df) of difference of ⫺2 log likelihood between models 1 and 3 Nagelkerke’s ⌬R2 ⫽ ⱖ0.035 (uniform DIF: model 2 minus 1, nonuniform DIF: model 3 minus 2)
Step 1 (initial screen): test of significance‡
Results Uniform DIF: met threshold Uniform DIF: near threshold
Method 2: logistic regression
None available
* DIF ⫽ differential item functioning; IRT ⫽ item response theory; PBonf ⫽ P value with Bonferroni correction; ANOVA ⫽ analysis of variance. † Reflects results from the preliminary DIF assessment (i.e., prior to model modifications during Rasch analysis). ‡ For this study, we confirmed DIF (uniform or nonuniform) only if an item met both the test of significance and the effect size statistical criteria.
tem applied was too broad to be informative on differences in the specific nature of the job requirements. Nonetheless, observed disease-related DIF for these items suggests that stiffness at work could be an experience that may have greater intrinsic relevance for workers with OA. Precise scoring estimation of outcomes is fundamental to proper interpretation of results. While the current analysis suggests that disease-related DIF ultimately had little overall impact at the scale level, for future comparisons of WI between RA and OA it may still be worthwhile to consider potential strategies to account for DIF for the 3 most relevant items, where possible. One strategy may be to perform a sensitivity analysis with a DIF-free version of the RA-WIS (i.e., 20 items) to confirm subgroup differences in WI where such biases could be a concern. A second option may be to explore item-splitting approaches to “adjust” for DIF (19,21) in order to establish disease-specific parameters for individual items to facilitate cross-disease comparisons. The increased complexity of scoring the RAWIS with such an approach, however, is a potential tradeoff that must be considered, especially from the perspective of clinical practicality. Our relatively small sample size is a study limitation to be considered, as it is below the typical recommendation of n ⫽ ⬎200 per comparison group for DIF assessments. However, ⬎100 per group has been considered adequate for binary LR and 1-parameter IRT methods (35). Moreover, we believe converging findings from multiple wellestablished DIF detection methods provided additional strength to our results, in addition to the fact that we applied conservative statistical criteria to help minimize potential Type I error. Other methodologic strengths to be considered were the use of both parametric and nonparametric approaches and the application of both summed and latent trait scores (i.e., IRT) as the conditioning vari-
able to provide diverse perspectives in our DIF assessments. We conclude that although 3 RA-WIS items showed disease-related DIF, this had a negligible resultant impact on the comparability of scores at the scale level. This suggests that, ultimately, RA-WIS scores can be directly compared between RA and OA without significant concerns for DIF-related measurement bias. It is important to reiterate that the RA-WIS was originally intended as a disease-specific measure, with items specifically developed for RA, and therefore evidence of some diseaserelated DIF at the item level was not unexpected. In fact, we believe the relatively small proportion of items exhibiting DIF is indicative of the strong resonance of the overall concept of WI to workers with other forms of arthritis such as OA, where work disability is also an important concern. With ongoing interest to apply work-specific measures in a broad range of other rheumatic conditions, item biases are expected to be increasingly important to consider, and similar works in the future will be useful to ensure that cross-disease comparisons of outcomes are appropriately conducted.
ACKNOWLEDGMENTS The authors would like to acknowledge the participating institutions where data for the current study objective were collected: Mount Sinai Hospital (Toronto, Ontario, Canada), the Martin Family Centre for Arthritis Care and Research at St. Michael’s Hospital (Toronto, Ontario, Canada), and the Mary Pack Arthritis Program (Vancouver, British Columbia, Canada). The authors would also like to thank the Institute for Work & Health and the Arthritis Community Research & Evaluation Unit for providing inkind support. Finally, the authors wish to acknowledge
1168
Tang et al
individuals who have made contributions to the overall Canadian Arthritis Network Work Productivity project: Xingshan Cao (data analyst), Paul Clarke (research coordinator), Timea Donka (research assistant), Rebecca Dube´ (research assistant), Katherine Edwards (research assistant), Taucha Inrig (research assistant), Carol Kennedy (research assistant), Jessica Lee (research coordinator), Xin Li (postdoctoral fellow), Samra Mian (research assistant), Ludmila Mironyuk (research coordinator), Anusha Raj (research associate), Pam Rogers (research coordinator), Rebeka Sujic (research coordinator), Debbie Sutton (data analyst), Ada Todd (research coordinator), Dwayne Van Eerd (research coordinator), Rebecca Wickett (research coordinator), Jessica Widdifield (research coordinator), and Wei Zhang (graduate student). Investigators of the Canadian Arthritis Network Work Productivity Group are as follows: Dr. Dorcas E. Beaton (principal investigator), Dr. Claire Bombardier (principal investigator), Dr. Aslam H. Anis (coinvestigator), Dr. Elizabeth M. Badley (coinvestigator), Dr. Monique A. M. Gignac (coinvestigator), and Dr. Diane Lacaille (coinvestigator).
8. 9. 10. 11. 12. 13. 14. 15.
16.
AUTHOR CONTRIBUTIONS All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be published. Mr. Tang had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Study conception and design. Tang. Analysis and interpretation of data. Tang.
17.
18. 19.
ROLE OF THE STUDY SPONSOR The authors declare that Abbott had no direct role in the study design, data collection, analysis and interpretation of the data, writing of the manuscript, approval of manuscript content, or decision to publish this work. Neither the submission nor the publication of this article was contingent on the approval of Abbott.
REFERENCES 1. Backman CL. Employment and work disability in rheumatoid arthritis. Curr Opin Rheumatol 2004;16:148 –52. 2. Badley EM, Wang PP. The contribution of arthritis and arthritis disability to nonparticipation in the labor force: a Canadian example. J Rheumatol 2001;28:1077– 82. 3. Yelin E, Meenan R, Nevitt M, Epstein W. Work disability in rheumatoid arthritis: effects of disease, social, and work factors. Ann Intern Med 1980;93:551– 6. 4. Gilworth G, Chamberlain MA, Harvey A, Woodhouse A, Smith J, Smyth MG, et al. Development of a work instability scale for rheumatoid arthritis. Arthritis Rheum 2003;49:349 – 54. 5. Tang K, Beaton DE, Gignac MA, Lacaille D, Zhang W, Bombardier C, and the Canadian Arthritis Network Work Productivity Group. The Work Instability Scale for Rheumatoid Arthritis predicts arthritis-related work transitions within 12 months. Arthritis Care Res (Hoboken) 2010;62:1578 – 87. 6. Beaton DE, Tang K, Gignac MA, Lacaille D, Badley EM, Anis AH, et al. Reliability, validity, and responsiveness of five at-work productivity measures in patients with rheumatoid arthritis or osteoarthritis. Arthritis Care Res (Hoboken) 2010; 62:28 –37. 7. Tang K, Beaton DE, Lacaille D, Gignac MA, Zhang W, Anis AH, et al. The Work Instability Scale for Rheumatoid Arthritis
20.
21.
22. 23. 24. 25. 26.
27. 28.
29.
(RA-WIS): does it work in osteoarthritis? Qual Life Res 2010; 19:1057– 68. Holland PW, Wainer H. Differential item functioning. Hillsdale (NJ): Lawrence Erlbaum; 1993. Camilli G. Test fairness. In: Brennan RL, editor. Educational measurement. 4th ed. Westport (CT): American Council on Education; 2006. p. 220 –56. Mantel N, Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst 1959;22:719 – 48. Swaminathan H, Rogers HJ. Detecting differential item functioning using logistic regression procedures. J Educ Meas 1990;27:361–70. Cole SR, Kawachi I, Maller SJ, Berkman LF. Test of itemresponse bias in the CES-D scale: experience from the New Haven EPESE study. J Clin Epidemiol 2000;53:285–9. Niti M, Ng TP, Chiam PC, Kua EH. Item response bias was present in instrumental activity of daily living scale in Asian older adults. J Clin Epidemiol 2007;60:366 –74. Borsboom D. When does measurement invariance matter? Med Care 2006;44 Suppl:S176 – 81. Bjorner JB, Kosinski M, Ware JE Jr. Calibration of an item pool for assessing the burden of headaches: an application of item response theory to the headache impact test (HIT). Qual Life Res 2003;12:913–33. Petersen MA, Groenvold M, Bjorner JB, Aaronson N, Conroy T, Cull A, et al. Use of differential item functioning analysis to assess the equivalence of translations of a questionnaire. Qual Life Res 2003;12:373– 85. Scott NW, Fayers PM, Aaronson NK, Bottomley A, de Graeff A, Groenvold M, et al. The use of differential item functioning analyses to identify cultural differences in responses to the EORTC QLQ-C30. Qual Life Res 2007;16:115–29. Martin M, Blaisdell B, Kwong JW, Bjorner JB. The Short-Form Headache Impact Test (HIT-6) was psychometrically equivalent in nine languages. J Clin Epidemiol 2004;57:1271– 8. Tennant A, Penta M, Tesio L, Grimby G, Thonnard JL, Slade A, et al. Assessing and adjusting for cross-cultural validity of impairment and activity limitation scales through differential item functioning within the framework of the Rasch model: the PRO-ESOR project. Med Care 2004;42 Suppl:I37– 48. Schmidt S, Debensason D, Muhlan H, Petersen C, Power M, Simeoni MC, et al. The DISABKIDS generic quality of life instrument showed cross-cultural validity. J Clin Epidemiol 2006;59:587–98. Lundgren-Nilsson A, Tennant A, Grimby G, Sunnerhagen KS. Cross-diagnostic validity in a generic instrument: an example from the Functional Independence Measure in Scandinavia. Health Qual Life Outcomes 2006;4:55. Fries JF, Spitz P, Kraines RG, Holman HR. Measurement of patient outcome in arthritis. Arthritis Rheum 1980;23:137– 45. Millsap RE, Everson HT. Methodology review: statistical approaches for assessing measurement bias. Appl Psych Meas 1993;17:297–334. Hambleton RK. Good practices for identifying differential item functioning. Med Care 2006;44 Suppl:S182– 8. Teresi JA, Fleishman JA. Differential item functioning and health assessment. Qual Life Res 2007;16 Suppl:33– 42. Crane PK, Gibbons LE, Ocepek-Welikson K, Cook K, Cella D, Narasimhalu K, et al. A comparison of three sets of criteria for determining the presence of differential item functioning using ordinal logistic regression. Qual Life Res 2007;16 Suppl: 69 – 84. Jodoin MG, Gierl MJ. Evaluating Type I error and power rates using an effect size measure with logistic regression procedures for DIF detections. Appl Meas Educ 2001;14:329 – 49. Dorans NJ, Holland PW. DIF detection and description: Mantel-Haenszel and standardisation. In: Holland P, Wainer H, editors. Differential item functioning: theory and practice. Hillsdale (NJ): Lawrence Erlbaum Associates; 1993. p. 36 – 66. Holland PW, Thayer DT. An alternative definition of the ETS delta scale of item difficulty (research report 85-43). Princeton (NJ): Educational Testing Service; 1985.
Differential Item Functioning in the RA-WIS 30. Penfield RD. Application of the Breslow-Day test of trend in odds ratio heterogeneity to the detection of nonuniform DIF. Alberta J Educ Res 2003;49:231– 43. 31. Breslow NE, Day NE. Statistical methods in cancer research. Vol. I. The analysis of case-control studies. Lyon: International Agency for Research on Cancer; 1980. 32. Clauser BE, Mazor KM. Using statistical procedures to identify differentially functioning test items. Educ Meas Issues Pract 1998;2:31– 44. 33. Swaminathan H. Differential item functioning: a discussion. In: Laveault D, Zumbo BD, Gessaroli ME, Boss MW, editors. Modern theories of measurement: problems and issues. Ottawa: University of Ottawa; 1994. p. 171– 80. 34. Zumbo BD. A handbook on the theory and methods of differential item functioning (DIF): logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa: Directorate of Human Resources Research and Evaluation, Department of National Defense; 1999. 35. Lai JS, Teresi J, Gershon R. Procedures for the analysis of differential item functioning (DIF) for small sample sizes. Eval Health Prof 2005;28:283–94. 36. Camilli G, Shepard LA. Methods for identifying biased test items. Thousand Oaks (CA): Sage; 1994. 37. Hidalgo MH, Lopez-Pina JA. Differential item functioning detection and effect size: a comparison between logistic regression and Mantel-Haenszel procedures. Educ Psychol Meas 2004;64:903–15. 38. Rasch G. Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press; 1960. 39. Guttman LA. The basis of Scalogram analysis. In: Stouffer SA, Guttman LA, Suchman FA, Lazarsfel PF, Star SA, Clausen JA, editors. Studies in social psychology in World War II. Vol. 4.
1169
40. 41.
42. 43. 44. 45. 46. 47. 48. 49. 50.
Measurement and prediction. Princeton: Princeton University Press; 1950. p. 60 –90. Svensson E. Guidelines to statistical evaluation of data from rating scales and questionnaires. J Rehabil Med 2001;33:47– 8. Tennant A, Conaghan PG. The Rasch measurement model in rheumatology: what is it and why use it? When should it be applied, and what should one look for in a Rasch paper? Arthritis Rheum 2007;57:1358 – 62. Fisher WJ. Reliability statistics. Rasch Meas Trans 1992;6:238. Wright BD. Local dependency, correlations and principal components. Rasch Meas Trans 1996;10:509 –11. Smith EV. Detecting and evaluating the impact of multidimensionality using item fit statistics and principal component analysis of residuals. J Appl Meas 2002;3:205–31. Tennant A, Pallant JF. Unidimensionality matters. Rasch Meas Trans 2006;20:1048 –51. Steinberg L, Thissen D. Uses of item response theory and the testlet concept in the measurement of psychopathology. Psychol Methods 1996;1:81–97. Smith RM. Fit analysis in latent trait measurement models. J Appl Meas 2000;1:199 –218. Hagquist C, Andrich D. Is the sense of coherence-instrument applicable on adolescents? A latent trait analysis using Rasch modelling. Pers Indiv Differ 2004;36:955– 68. Andrich D, Lyne A, Sheridan B, Luo G. RUMM 2020. Perth: RUMM Laboratory; 2003. Groenvold M, Petersen MA. The role of use of differential item functioning (DIF) analysis of quality of life data from clinical trials. In: Fayers PM, Hays RD, editors. Assessing quality of life in clinical trials. Oxford: Oxford University Press; 2005. p. 195–208.