ORIGINAL ARTICLE
Practical Issues in the Application of Item Response Theory A Demonstration Using Items From the Pediatric Quality of Life Inventory (PedsQL) 4.0 Generic Core Scales Cheryl D. Hill, PhD,* Michael C. Edwards, PhD,† David Thissen, PhD,‡ Michelle M. Langer, MA,‡ R. J. Wirth, MA,‡ Tasha M. Burwinkle, PhD,§ and James W. Varni, PhD¶储
Background: Item response theory (IRT) is increasingly being applied to health-related quality of life instrument development and refinement. This article discusses results obtained using categorical confirmatory factor analysis (CCFA) to check IRT model assumptions and the application of IRT in item analysis and scale evaluation. Objectives: To demonstrate the value of CCFA and IRT in examining a health-related quality of life measure in children and adolescents. Methods: This illustration uses data from 10,241 children and their parents on items from the 4 subscales of the PedsQL 4.0 Generic Core Scales. CCFA was applied to confirm domain dimensionality and identify possible locally dependent items. IRT was used to assess the strength of the relationship between the items and the constructs of interest and the information available across the latent construct. Results: CCFA showed generally strong support for 1-factor models for each domain; however, several items exhibited evidence of local dependence. IRT revealed that the items generally exhibit favorable characteristics and are related to the same construct within a given domain. We discuss the lessons that can be learned by comparing alternate forms of the same scale, and we assess the potential impact of local dependence on the item parameter estimates. Conclusions: This article describes CCFA methods for checking IRT model assumptions and provides suggestions for using these methods in practice. It offers insight into ways information gained through IRT can be applied to evaluate items and aid in scale construction.
From the *RTI Health Solutions, Research Triangle Park, North Carolina; †Department of Psychology, The Ohio State University, Columbus; ‡Department of Psychology, University of North Carolina, Chapel Hill; §Department of Pediatrics, Texas A&M University College of Medicine, Temple; ¶Department of Pediatrics, College of Medicine; and 㛳Department of Landscape Architecture and Urban Planning, College of Architecture, Texas A&M University, College Station. Supported by National Institutes of Health Grant 1U01AR052181-01. Presented at the annual meeting of the International Society for Quality of Life Research on October 20, 2005 in San Francisco, CA. Reprints: Cheryl D. Hill, PhD, RTI Health Solutions, 200 Park Offices Drive, P.O. Box 12194, Research Triangle Park, NC 27709-2194. E-mail:
[email protected]. Copyright © 2007 by Lippincott Williams & Wilkins ISSN: 0025-7079/07/4500-0039
Medical Care • Volume 45, Number 5 Suppl 1, May 2007
Key Words: IRT, factor analysis, instrument development, PedsQL (Med Care 2007;45: S39 –S47)
T
he Patient-Reported Outcomes Measurement Information System (PROMIS) project aims to assemble health-related quality of life (HRQoL) item banks for developing both adaptive (ie, computerized adaptive testing 关CAT兴) and nonadaptive (ie, linear) patient-reported outcomes instruments.1 This process of item banking and test assembly relies heavily on item response theory (IRT) to assess the properties of the candidate items that inform the assignment of items to domain banks and the selection of appropriate items for instruments. As with any model, the use of IRT implies a number of assumptions about the data.2 This article discusses methods that can be used to check 2 primary assumptions of many IRT models, unidimensionality and local independence. In presenting these methods, we work through an example using data on items from an existing HRQoL instrument; these items were considered for inclusion in the PROMIS item bank and were also used to inform the development of new items for use with PROMIS. IRT models describe the probability of observing a particular pattern of responses given the respondent’s level on the underlying construct (). With the 2-parameter logistic (2PL) model, which is appropriate for items measured in 2 response categories (eg, yes/no, true/false), this probability is modeled using a slope parameter (ai) and a location parameter (bi) for each item i. The slope parameter measures the strength of the relationship between the item and the underlying construct; higher slopes mean that the item can discriminate more sharply between respondents above and below some level on the latent continuum. For dichotomous items, the location parameter is the point along the latent continuum at which the item is most discriminating or informative; a respondent whose level on the underlying construct is at this location has a 50% chance of endorsing the item. In fields such as educational measurement, the location parameter is known as the difficulty parameter, where higher values are associated with more difficult items (ie, the respondent must be higher on the latent trait to provide a correct response).
S39
Medical Care • Volume 45, Number 5 Suppl 1, May 2007
Hill et al
The probability of endorsing an item is described by a function of these item parameters called a trace line, or item characteristic curve, which takes the form for the 2PL model of: T共ui ⫽ 1兩兲 ⫽
1 , 1 ⫹ exp关 ⫺ ai 共 ⫺ bi 兲兴
(1)
where ui ⫽ 1 refers to a positive response to item i.3 An alternative model often used in health outcomes research is Samejima’s graded response model (GRM),4,5 which generalizes the 2PL model to include multiple bij parameters per item (j from 1 to m ⫺ 1) to correspond to m response categories (eg, items with the response scale “Strongly Disagree,” “Disagree,” “Neutral,” “Agree,” and “Strongly Agree”). The formula for a GRM trace line is: T共ui ⫽ j兩 兲 ⫽
1 1 ⫹ exp关⫺ ai 共 ⫺ bij 兲兴 ⫺
1 1 ⫹ exp关 ⫺ ai 共 ⫺ bij ⫹ 1)兴
,
(2)
which states that the probability of responding in category j is the difference between a 2PL trace line for the probability of responding in category j or higher and a 2PL trace line for the probability of responding in category j ⫹ 1 or higher. In the case of the GRM, a respondent with an underlying construct value of bij has an equal probability of choosing category j or lower and category j ⫹ 1 or higher. These trace lines can be plotted as the probability of endorsement along the continuum of the latent trait to provide a visual representation of location and discrimination. An expected score plot is an alternative to a trace line plot that collapses the lines for each category into 1 trajectory, showing the expected response score across the latent trait. Trace lines can also be used to calculate information curves that display the amount of information an item provides along the continuum of the latent trait. In health outcomes research, items are often scored so that higher scores indicate that the respondent is higher on the scale of the latent construct, or that the individual possesses more of the trait that the items are designed to measure. For example, a scale designed to assess quality of life would be scored so that higher scores correspond with higher quality of life. This would also mean that categories with larger bij parameters would be more likely to be endorsed by respondents with better quality of life than those with worse quality of life. Because of the way IRT models combine information across items, 2 primary data requirements must be met. First, the scale must be unidimensional, that is, the pattern of item responses is best described by 1 dominant construct. When items that are related to multiple underlying constructs are forced to provide information for 1 construct alone, it is difficult to determine what construct is being represented in the ensuing scale score. Second, the items must be locally independent, which means that the probabilities of each item
S40
response are related only through the value of the latent variable. That is, after accounting for the respondent’s latent variable value, there should be no relationship between the responses to different items. Items that do have a relationship apart from the latent variable can create their own second dimension that explains covariance between these items that is not shared with the other items on the scale. This becomes a specific factor that is common to the locally dependent items and is separate from the general factor common to all items on the scale. When this multidimensional scale is forced into a unidimensional model, if the locally dependent items are strongly defined (ie, high-factor loadings) and the remaining items are weakly defined (ie, low-factor loadings), the strength of the relationship between the locally dependent items can change the construct measured by a scale by causing the 1 factor to be a measure of the specific factor rather than the general factor of interest.6 The goal of this article is to outline the use of categorical confirmatory factor analysis (CCFA) and IRT in item selection and scale development as applied in the PROMIS project, using examples from data obtained from items on the PedsQL 4.0 Generic Core Scales.7 CCFA, a factor analytic approach that accounts for the non-normality of categorical data that renders traditional confirmatory factor analysis methods inappropriate, will be used to assess domain dimensionality and to identify possible locally dependent items. Although there are other approaches for assessing local dependence and dimensionality, the use of CCFA was supported by the PROMIS psychometric team.8 IRT will be used to assess how well the items measure the construct of interest and the appropriateness of the set of items for various ranges on the latent construct.
METHODS The use of CCFA and IRT will be demonstrated using data on items from the 4 subscales of the PedsQL 4.0 Generic Core Scales.7 This instrument consists of 23 items designed to measure HRQoL in children and adolescents. Four domains are assessed: (1) Physical Functioning, (2) Emotional Functioning, (3) Social Functioning, and (4) School Functioning. A number of instrument versions exist for various age ranges, different informants, and assorted languages; however, this example will focus only on the child self-report and parent proxy-report for children (ages 8 –12 years old) and adolescents (ages 13–18 years old) in English and Spanish. All analyses considered informant (self or parent), age (child or adolescent), and language (English or Spanish) separately, for a total of 8 replications of the analysis for each domain. Items from the PedsQL 4.0 Generic Core Scales were examined during the initial stages of the PROMIS project to obtain information about the dimensionality of the domains of interest to the project, to provide preliminary information about some items being considered for inclusion in the PROMIS item bank, and to familiarize the research team with the analysis plan that will be applied to PROMIS data when they become available. Thus, this analysis is not intended to be an evaluation of the PedsQL 4.0 Generic Core © 2007 Lippincott Williams & Wilkins
Medical Care • Volume 45, Number 5 Suppl 1, May 2007
Scales, but rather an examination of individual items under consideration for use with PROMIS. Items are scored on a 5-point response scale (0 ⫽ “never a problem,” 1 ⫽ “almost never a problem,” 2 ⫽ “sometimes a problem,” 3 ⫽ “often a problem,” 4 ⫽ “almost always a problem”). Because of the natural direction of the response scale, models were fit with higher scores indicating higher severity. (It is important to note that the PedsQL 4.0 Generic Core Scales scoring instructions have the items reversed-scored and modeled with higher scores indicating higher quality of life).7 Whereas the PROMIS project has an elaborate plan for selecting samples on which to calibrate item parameters that are appropriate for measurement in the intended populations, this example used a convenience sample obtained through the California State Children’s Health Insurance Program, consisting primarily of healthy children. The PedsQL 4.0 Generic Core Scales were mailed to 20,031 English, Spanish, Vietnamese, Korean, or Cantonese speaking families in California with children between 2–16 years old who enrolled in California State Children’s Health Insurance Program during the months of February and March, 2001.9 The number of returned surveys was 10,241, and only data from the families who spoke English or Spanish with children 8 years of age and older were considered for these analyses. Additional details about the sample and its characteristics can be found in a study by Varni et al.9 Item 5 on the Physical Functioning domain was omitted from the analyses because nearly everyone responded “never” to having problems with “taking a bath or shower by myself.” This very strong floor effect would make parameter estimation difficult. With the removal of this item, there were 7 items on the Physical Functioning domain and 5 items on each of the Emotional, Social, and School Functioning domains. CCFA remains an area of active development, with the estimation methods performing well under some circumstances (eg, small number of items, large samples)10 and failing under other circumstances (eg, sparse data, complex models)11; so a number of estimation methods were used with 2 software packages (PRELIS12/LISREL13 and Mplus14). Researchers generally agree that using Pearson product moment correlations to fit factor analytic models with categorical variables induces bias in the parameters and fit statistics,15 so polychoric correlations among the items were obtained using listwise deletion to eliminate missing data. Weighted least squares (WLS) estimation is currently the statistically optimal method for CCFA because it provides appropriate statistical estimates with categorical data.16,17 However, WLS uses a weight matrix (the asymptotic covariance matrix) that must be inverted, which can create numerical problems when the sample size is small or the model has many measured variables.12 Alternative approaches include diagonally weighted least squares (DWLS, available in LISREL) or robust weighted least squares with a mean and variance correction (RWLSM/V, available in Mplus), which is WLS with a diagonal weight matrix. Thus, WLS and DWLS or RWLSM/V methods were used in each of the software packages. Unweighted least squares (ULS) estimation was © 2007 Lippincott Williams & Wilkins
IRT in Health Outcomes Research
also used as a conservative, though least desirable, alternative. These estimation methods cover a range of weighting from full to partial to none. WLS is the ideal estimation method for CCFA because it has the potential to provide estimates and fit statistics that are closest to the truth. However, when the data are inappropriate for WLS (eg, small samples, complex models), WLS will produce unstable results. Under such conditions, DWLS, RWLSM/V, or ULS estimation will provide good approximations to the parameter estimates. Disadvantages to these methods include that fit statistics obtained with DWLS or RWLSM/V have unknown properties at this time, and ULS provides limited fit information. WLS should be used when possible, but these alternative estimation methods may be considered when necessary. We were fortunate to have an adequate size for the current application, which allowed us to use WLS throughout our analyses. In many cases, issues of sample size will prevent users from being able to use WLS, so we report ULS and DWLS here in an attempt to better understand how they perform relative to WLS. It is hoped that these comparisons will provide greater confidence in understanding ULS and DWLS (but especially DWLS) results when WLS becomes unfeasible. One-factor solutions were obtained for the items on each of the 4 domains to assess the unidimensionality of each domain and the local independence of the items within each domain. Particular attention was given to the size of the factor loadings, fit statistics, and modification indices. In a desirable solution the items load substantially on the factor, the 2 statistic is nonsignificant, the root mean square error of approximation (RMSEA) is less than 0.05,18 the NonNormed Fit Index (NNFI) is greater than 0.92,19 the root mean square residual (RMR) is near 0, and the modification indices for the error covariances are all of small magnitude. Consistency across the 8 forms of the domain is ideal. The goal of small modification indices is based on the idea that a modification index is an estimate of the change that would be seen in the size of the 2 statistic if that constrained parameter were unconstrained. The size of a modification index should be considered in regard to the other modification indices (ie, is this modification index abnormally large compared with others in this model?) and in regard to the magnitude of the 2 statistic (ie, will model fit substantially improve if this parameter was allowed to vary?). It is important to note that perfect model fit is rarely achieved and, instead, the goal is to find a set of items that are “essentially” unidimensional.20 That is to say, a set of items that is defined by 1 dominant dimensional may meet the requirement of unidimensional IRT models. It is less important for the model to be perfect than it is for it to be useful. An additional 4-factor model was obtained for the entire set of 22 PedsQL 4.0 Generic Core Scales items (excluding item 5 on the Physical Functioning domain) in which the items on each domain loaded only on the factor corresponding to that domain, and the factors were correlated. This model could offer additional support for the assumption of unidimensionality if restricting the association of the items
S41
Medical Care • Volume 45, Number 5 Suppl 1, May 2007
Hill et al
to their respective domains does not result in a poorly fitting model. Large modification indices between items on different domains might be an indication of multidimensionality within domains, whereas large modification indices between items on the same domain might be an indication of local dependence. Once the dimensionality of the domains and local independence among the items had been considered, this information was used to inform IRT parameter estimation. Although many IRT models are available for use with polytomous data, Samejima’s GRM was chosen by the PROMIS psychometric team8 and was applied to these analyses. Maximum marginal likelihood estimation was used to fit the items on each of the domains using Multilog.21 Plots of the trace lines, the expected score, and the information curve were made for each item, facilitating comparisons across informant, age, and language. When local dependence was suspected, parameters for that domain were estimated with and without the problematic items, and comparisons were made. Support for local dependence was found when the item slopes were substantially different between the model that included the locally dependent items and the model in which the potential dependencies were removed.
TABLE 1. Factor Loadings (Standard Errors) and 2 Statistics From Lisrel v8.54 for 3 Estimation Methods Using Items From the Physical Functioning Domain With English Child Self-Report Data Item 1 Item 2 Item 3 Item 4 Item 6 Item 7 Item 8 C1 2 (df C2 2 (df C3 2 (df C4 2 (df
⫽ ⫽ ⫽ ⫽
14) 14) 14) 14)
ULS
DWLS
WLS
0.79 (0.03) 0.83 (0.02) 0.87 (0.02) 0.68 (0.03) 0.59 (0.03) 0.71 (0.03) 0.71 (0.03) — 243 67 55
0.80 (0.03) 0.84 (0.02) 0.87 (0.02) 0.68 (0.03) 0.59 (0.03) 0.71 (0.02) 0.71 (0.03) — 253 67 55
0.81 (0.03) 0.86 (0.02) 0.89 (0.02) 0.71 (0.03) 0.60 (0.03) 0.76 (0.02) 0.77 (0.02) 55 — — —
Note: LISREL produces up to 4 different 2 values in the output. C1 is the minimum fit function 2 and is the only measure of fit provided if WLS is used. C2 is the normal theory weighted least squares 2, which assumes normality in the measured variable. With categorical data, under ULS or DWLS estimation, C2 does not follow a 2 distribution and should not be used. C3 is the Satorra–Bentler Scaled 2 and C4 is the 2 corrected for non-normality. Both are theoretically correct, the difference being that C3 does not require the inversion of the asymptotic covariance matrix.
RESULTS Categorical Confirmatory Factor Analysis Using 8 samples and 3 estimation methods (in 2 computer programs), we made a number of comparisons on each domain to assess the fit of the model and to gain confidence in the results. The results were compared across estimation methods to identify the method that produced the most consistent, reasonable results. Then the results obtained with the chosen method were used to assess IRT model assumptions. The Physical Functioning domain presents a nice example of what CCFA results look like when the scale meets the assumptions of IRT. Results for ULS, DWLS, and WLS estimation methods for this domain (item 5 omitted) are compared in Table 1 (factor loadings, standard errors, and 2 statistics) and Table 2 (modification indices) using data from the English child self-report sample. Factor loadings are similar across estimation methods, though they increase slightly from ULS to DWLS (or WLSM/V) to WLS. It cannot be determined using empirical data if the ULS estimates are deflated or if the WLS estimates are inflated. The fit indices provide a relatively consistent picture of omnibus fit across the various estimation methods, although some caution must be taken in deciding which fit indices to consider (see note for Table 1). Here, 2 fit indices are presented because they are important in interpreting modification indices. However, as with this example, 2 statistics tend to indicate significance when the sample is large,22 and alternative fit indices should also be considered. Modification indices were not entirely consistent across methods. Those obtained using ULS and DWLS (or RWLSM/V) are more erratic than those obtained using WLS (when comparing estimation methods across computer software). In fact, 2 of the modification indices obtained under ULS are impossibly large, as they are greater in magnitude than the omnibus 2. Theoretically, DWLS (or WLSM/V)
S42
TABLE 2. Modification Indices (Rounded to the Nearest Whole Number to Facilitate Comparisons) for Various Estimation Methods Using Items From the Physical Functioning Domain With English Child Self-Report Data Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item Item
2 3 3 4 4 4 6 6 6 6 7 7 7 7 7 8 8 8 8 8 8
with with with with with with with with with with with with with with with with with with with with with
item item item item item item item item item item item item item item item item item item item item item
1 1 2 1 2 3 1 2 3 4 1 2 3 4 6 1 2 3 4 6 7
ULS
DWLS
WLS
346 2567 9 1 10 1 0 0 0 17 57 2 14 17 26 0 0 13 1 0 124
6 25 68 0 6 0 0 7 1 3 9 3 11 14 5 2 0 5 0 0 64
0 4 6 0 6 0 0 3 0 1 8 0 12 11 4 0 0 1 3 1 11
Note: WLS was chosen as the appropriate estimation method for these data, and no modification index stands out as particularly large. An example of a noteworthy modification index would be that between items 2 and 3 or between items 7 and 8 under DWLS, and had this modification method been chosen, an alternative model with these error variances unconstrained would have been examined. However, both ULS and DWLS show modification indices that are larger than the 2 statistic for that method, and so there is some doubt as to the usefulness of these indices in these cases.
© 2007 Lippincott Williams & Wilkins
Medical Care • Volume 45, Number 5 Suppl 1, May 2007
and WLS should provide accurate estimates and fit statistics in this situation. It is somewhat disturbing to see these differences in the modification indices provided under each method. These findings were consistent across domain and sample. As WLS is the gold standard estimator and seems to perform in a stable fashion, it was chosen as the method on which the evaluation of the domains is based. The factor loadings for the Physical Functioning domain were similar across samples. The suggestion for evaluating RMSEA values is that values less than 0.05 indicate close fit, values less than 0.08 indicate reasonable fit, and values greater than 0.10 indicate an unfavorable model.18 For the Physical Functioning domain, the RMSEA values ranged from 0.05 to 0.10, indicative of an acceptable model. This was supported by NNFIs between 0.95 and 0.98 and RMRs between 0.06 and 0.13. Most modification indices were similar across forms. Items 7 (“I hurt or ache”) and 8 (“I have low energy”) had somewhat larger modification indices on the parent proxy-report forms than on the self-report forms (ie, these were between 10 and 94, whereas those for other items were between 0 and 20), indicating correlated error variances between these items for parents. These items were the only 2 that did not ask about activities that could be observed by a proxy reporter (eg, walking, lifting, doing chores), so it makes sense that these items are more related to each other in the proxy-report forms given that they are externally observable behaviors. The modification indices were not sufficiently large to conclude that the items are locally dependent, therefore, the Physical Functioning domain was considered to be essentially unidimensional with no local dependence. Thus, the 7 items on the Physical Functioning domain could be considered together in IRT analyses. The School Functioning domain is an example of CCFA results that raise questions about the validity of the IRT assumptions. The factor loadings for the School Functioning domain, though not necessarily identical across samples, were large in magnitude for all forms. The modification indices were consistent across samples; items 4 (“I miss school because of not feeling well”) and 5 (“I miss school to go to the doctor or hospital”) showed very large modification indices on all 8 forms (ie, these were between 65 and 293, whereas those for others were between 0 and 231). In contrast to the other items that concern paying attention in class, forgetting things, and keeping up with schoolwork, these 2 items clearly have the concept of missing school in common. When these items were allowed to have correlated errors, the fit of the model improved substantially. For example, on the teen self-report form in Spanish, allowing these items to have correlated errors reduced the RMSEA from 0.16 to 0.02, the RMR from 0.14 to 0.02, and the largest modification index from 13 to 3, whereas this change increased the NNFI from 0.89 to 1.0. Because of the magnitude of the modification indices and the content of the items, items 4 and 5 were suspected of local dependence. Additionally, the modification indices for items 1 (“It is hard to pay attention in class”) and 3 (“I have trouble keeping up with my schoolwork”) also were elevated on the parent proxy-reports (eg, modification index of 231 mentioned above). This is likely because these © 2007 Lippincott Williams & Wilkins
IRT in Health Outcomes Research
items concern behavior that the parent might not necessarily observe (ie, the parent is not at school with the child when these behaviors occur). Because the PROMIS project focuses on self-report measurement, the potential local dependence between these items for parent report data was noted but was not of primary concern in these analyses. Thus, IRT analyses could proceed on the 5 School Functioning items as long as the potential dependence between items 4 and 5 is further examined within an IRT model. The remaining domains, Emotional Functioning and Social Functioning, were also found to be unidimensional with no obvious local dependence and consistent across samples, and the planned IRT analyses seemed appropriate for the items within each domain. Finally, the parameters of a 4-factor model were estimated, and the results were consistent with those found in the unidimensional models. The factor loadings were large in magnitude and the modification indices agreed with those found in the unidimensional models. Modification indices for items loading on the factors for other domains were all small, suggesting that the model would not be substantially improved by allowing items on 1 domain to load on the factor associated with another domain. The factors were highly correlated (correlations ranging from 0.75 to 0.88). The unidimensionality of each domain was supported by the results of the 4-factor model, which showed reasonably good fit. For example, the child self-report sample in Spanish produced an RMSEA of 0.05, an NNFI of 0.95, and an RMR of 0.19.
Item Response Theory Based on the results of CCFA, the 4 domains were each modeled using a unidimensional IRT model. Because the possibility of local dependence between items 4 and 5 on the School Functioning domain was suggested by CCFA, this domain was modeled 3 times: with all 5 items, with item 4 removed, and with item 5 removed. The Emotional Functioning domain is a case with reasonable item and test characteristics and consistency across test forms. For example, Figure 1 shows item 2 (“I feel sad or blue”) for both the self-report and proxy-report, child and teen forms in Spanish. The sets of trace lines are nearly coincident and the information obtained using this domain is similar across forms. The steepness of the trace lines (slope ranges from 2.4 to 3.9, depending on the sample) indicates that the item is highly discriminating for examinees who are somewhat above average in their symptom severity, and an item with trace lines of this magnitude would be a desirable addition to an item bank. However, this item would not provide much information for examinees with good quality of life (ie, those to the left of 0 on the scale) because the trace lines suggest that they would be very likely to respond “never” (thresholds range from ⫺0.2 to 3.3 depending on the sample). In other words, a CAT should select this item for someone indicating above average severity, but this item would be less useful for someone indicating good quality of life. The parameters of the other items on the Emotional Functioning domain were consistent with item 2 (slopes range from 1.6 to 3.9, thresholds range from ⫺0.9 to 4.0). All
S43
Hill et al
Medical Care • Volume 45, Number 5 Suppl 1, May 2007
FIGURE 1. Item 2 from the Emotional Functioning domain for Spanish child self-report (solid lines in left panel), Spanish child parent-proxy report (dashed lines in left panel), Spanish teen self-report (solid lines in right panel), and Spanish teen parentproxy report (dashed lines in right panel).
5 items have desirable properties for the PROMIS item bank, though additional items should be written for measuring examinees with good quality of life. In contrast, the School Functioning domain contains 2 items with relatively poor characteristics. The results of CCFA suggested that item 4 (“I miss school because of not feeling well”) and item 5 (“I miss school to go to the doctor or hospital”) might be locally dependent. For this domain, the parameters of the IRT model were estimated in 3 ways, once with all 5 items (solid lines in Fig. 2) and then removing either item 4 or item 5 (dashed lines in Fig. 2). Often an indicator of local dependence is that when both items are included, their slopes may be high whereas those for the other items are lower.6 When either item from a locally dependent pair is removed, the slope of the item remaining from the pair decreases and the slopes of the other items increase. However, this pattern was not found because item 4 and item 5 are poorly related to the outcome of interest (slopes range from 0.7 to 1.2). Regardless of their inclusion or exclusion in the model, the parameter estimates for these items and the other items on the domain did not change. Despite their excess relationship, it did not damage the fit of the model to include both items 4 and 5 on the domain, though they do not add much information for scoring (bottom panels of Fig. 2). This is a case where violation of IRT assumptions, specifically multidimensionality, must be investigated but may be ignored if it turns out to have no impact on measurement. Still, because of the limited information available from the items, they would add little measurement value to an item bank and be a poor choice for administration on a CAT.
S44
School Functioning items 4 and 5 discriminate poorly among examinees on the latent construct. For example, an examinee who responds in the second category (“almost never”) is as likely to have good quality of life as she is to have poor quality of life. Because the trace lines are very flat, the information curve is also very low and flat, indicating that responses to these items provide little information about the examinee’s overall quality of life. These curves are based on a sample of primarily healthy children, and it is possible that missing school because of not feeling well or to go to the doctor/hospital would be more related to school functioning for children with chronic diseases. In this population, the frequency of missing school may be more detrimental to school performance than is occasional school absence by healthy children. This possibility is examined in a study by Langer et al.23 It may be appropriate to calibrate all 5 items on the School Functioning domain together and include in the PROMIS item bank, though items 4 and 5 would be unlikely selections for CAT administration. It is also possible that the properties of items 4 and 5 may change in a different population, and this should be considered when these items are calibrated using samples chosen for PROMIS. Knowing that a pattern of inflated and deflated slopes may occur when locally dependent items are included in IRT parameter estimation led to a discovery of local dependence that was not apparent in the CCFA. Based on the CCFA, all 7 items included from the Physical Functioning domain were modeled using IRT. However, the slopes for items 1 (“It is hard for me to walk more than 1 block”), 2 (“It is hard for me to run”), and 3 (“It is hard for me to do sports activity or © 2007 Lippincott Williams & Wilkins
Medical Care • Volume 45, Number 5 Suppl 1, May 2007
IRT in Health Outcomes Research
FIGURE 2. Item 4 (left panel) and item 5 (right panel) from the School Functioning domain for English teen self-report when all items are calibrated together (solid lines). Dashed lines are item 4 calibrated with items 1–3 (left panel) and item 5 calibrated with items 1–3 (right panel).
exercise”) were found to be surprisingly large as compared with those of the other items in the model (slope values from around 4 to 6 vs. near 2, depending on the sample). These items seem to be concerned with performing an exercise or activity, whereas the other items refer to lifting, doing chores, having aches, and having low energy. The curves for these items from the Spanish teen parent-proxy report are presented as solid lines in Figure 3, and it is apparent that the items are highly related to the construct being measured, especially items 2 and 3 which have very steep curves. The large
information values suggest that these items are very useful for measuring physical functioning. However, because they turn out to be locally dependent, they essentially turn the construct into a measure of exercise performance; their inclusion on the scale may narrow the scope of the scale. The evidence for this conclusion is shown as dashed lines in Figure 3, which are the curves for each of the 3 items as estimated with the other 2 items removed from the model. The difference between the solid lines and the dashed lines is striking; the curves become much flatter and the information
FIGURE 3. Item 1 (left panel), item 2 (middle panel), and item 3 (right panel) from the Physical Functioning domain for Spanish teen parent-proxy report when all items are calibrated together (solid lines). Dashed lines are item 1 calibrated with items 4 and 6 – 8 (left panel), item 2 calibrated with items 4 and 6 – 8 (middle panel), and item 3 calibrated with items 4 and 6 – 8 (right panel). © 2007 Lippincott Williams & Wilkins
S45
Medical Care • Volume 45, Number 5 Suppl 1, May 2007
Hill et al
function is greatly reduced when the other items involved in the local dependence are removed (slopes range between 2 and 3). This is a reflection of the fact that each of the 3 items is weakly related to the remaining 4 items on the scale but the 3 items are strongly related to each other. By removing 2 of the 3 items, the remaining scale measures a broader construct (physical functioning) while leaving in the triplet of items results in a scale that primarily measures exercise performance. If the slopes in the original model had not been carefully scrutinized, local dependence could have persisted in the scale, changing the construct being measured through IRT scoring. Although all 7 Physical Functioning items may be included together in an item bank, the parameters for items 1–3 would have to be calibrated using separate samples and linked to the sample scale. Further, logic would have to be included in the CAT item selection algorithm to prohibit more than 1 of these 3 items from being administered to the same examinee.
DISCUSSION IRT can be a powerful tool in health outcomes assessment and will play a crucial role in PROMIS’s item banking and linear and adaptive test assembly.1 Assumptions of the IRT model must be checked before proceeding with item calibration. Without checking assumptions, the appropriateness of the model and the accuracy of the parameter estimates may be jeopardized. This article provides ways that CCFA and IRT analysis can be used in questionnaire development. CCFA facilitates identification of potential locally dependent items and evaluates domain dimensionality. Different estimation methods and even different software packages may produce different results, so it is important that the user be aware of the appropriateness of the estimation method to the particular type of data and be familiar with the software in use. In this example, having 8 samples facilitated assumption checking because trends across samples could be observed. It is reassuring when the results across all samples agree, but confidence slips with mixed results. It is possible that a scale could be unidimensional in 1 sample and multidimensional in another, as well as for 2 items to be locally dependent in some but not all samples. Cases such as these would require separate item calibrations for different samples, therefore, it is best to check assumptions for all available samples before proceeding with IRT. It has been our experience with CCFA that when sample sizes are large, WLS, DWLS, and ULS all obtain similar factor loadings. When the sample is inadequate for use with WLS, factor loadings tend to disagree between WLS and DWLS. Researchers who are uncertain about which method to utilize should try both, and use WLS when factor loadings agree and DWLS when factor loadings disagree. However, even with large samples, we have found fit indices and modification indices to be inconsistent across methods. When DWLS is applied, these indices should be used with caution. IRT provides evidence about the information from each item response for inferences about the respondent’s standing
S46
on the construct being measured. Item parameters can be used to inform scoring rules and to determine for which populations the scale is most appropriate. Item parameters are key in selecting items for administration to a particular examinee on a CAT such as PROMIS. Again, in our example, multiple samples facilitated a better understanding of the nature of the items. When discrepancies are found between item parameters for different samples, often the item content and the sample characteristics explain the results. Careful examination of the item parameters can also identify locally dependent items that are not revealed in CCFA. Items with very high slopes can dominate the domain, in which case local dependence must be eliminated for the scale to take on its intended meaning. Graphics facilitate item evaluation because it is easy to inspect the curves to identify items that provide substantial information in the range of the latent construct that the test is designed to measure. When locally dependent items are identified, 1 or both items do not necessarily have to be excluded from the item bank. Often, 1 item will be informative for examinees in 1 area of the latent trait, and the other will be similarly informative in another region. Such items must be calibrated separately so that they do not take over the latent trait of the scale, and the parameters can be subsequently linked. CAT programmers would then need to specify that these items should not be administered to the same examinee. In this example, the domains were found to be essentially unidimensional with some evidence of local dependence between some items. It is important that this local dependence be flagged before item calibration is finalized so that adjustments can be made to ensure that the slope parameters are on the metric of the construct of interest. All items seemed appropriate for inclusion in the PROMIS item bank, though some (eg, those with large slopes) are more likely to be selected by the CAT algorithm than others. The discrimination parameters for these items were generally positive, indicating that these items provide the most information for examinees with moderate to poor quality of life. This knowledge can be used to write PROMIS items that are designed to fill in the measurement gaps along the continuum of the latent traits. With the recent interest in CAT used in health outcomes research, CCFA and IRT will soon become staples in the health outcomes researcher’s statistical tool bag. Certainly, for the PROMIS project, CCFA and IRT will be invaluable for evaluating candidate items and assembling quality patient-reported outcomes instruments. We hope that this example emphasizes that these tools are powerful when used correctly, but can produce unintended results when model assumptions are not verified.
ACKNOWLEDGMENTS This work was funded by the National Institutes of Health through the NIH Roadmap for Medical Research. Information on this RFA (Dynamic Assessment of Patient-Reported Chronic Disease Outcomes) can be found at (http://nihroadmap.nih.gov/ clinicalresearch/overview-dynamicoutcomes.asp). © 2007 Lippincott Williams & Wilkins
Medical Care • Volume 45, Number 5 Suppl 1, May 2007
REFERENCES 1. Cella D, Yount S, Rothrock N, et al. The Patient-Reported Outcomes Measurement Information System (PROMIS): progress of an NIH Roadmap Cooperative Group during its first two years. Med Care. 2007; 45(Suppl 1):S3–S11. 2. Hambleton RK. Emergence of item response modeling in instrument development and data analysis. Med Care. 2000;38:II-60 –II-65. 3. Birnbaum A. Some latent trait models and their use in inferring an examinee’s ability. In: Lord FM, Novick MR, eds. Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley; 1968:395– 479. 4. Samejima F. Estimation of Latent Ability Using a Response Pattern of Graded Scores. Iowa City, IA: Psychometric Society; 1969. Psychometric Monograph No. 17. 5. Samejima F. Graded response model. In: van der Linden WJ, Hambleton RK, eds. Handbook of Modern Item Response Theory. New York, NY: Springer Verlag; 1997:85–100. 6. Chen W, Thissen D. Local dependence indexes for item pairs using item response theory. J Educ Behav Stat. 1997;22:265–289. 7. Varni JW, Seid M, Kurtin PS. The PedsQL™ 4.0: reliability and validity of the Pediatric Quality of Life Inventory™ Version 4. 0 Generic Core Scales in healthy and patient populations.Med Care. 2001;39:800 – 812. 8. Reeve BB, Hays RD, Bjorner JB, et al. Psychometric evaluation and calibration of health-related quality of life item banks: plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Med Care. 2007;45(Suppl 1):S22–S31. 9. Varni JW, Burwinkle TM, Seid M, et al. The PedsQL™ 4.0 as a pediatric population health measure: feasibility, reliability, and validity. Ambul Pediatr. 2003;3:329 –341. 10. Oranje A. Comparison of estimation methods in factor analysis with categorized variables: applications to NAEP data. Paper presented at Annual Meeting of the American Educational Research Association, Chicago, IL; April 2003.
© 2007 Lippincott Williams & Wilkins
IRT in Health Outcomes Research
11. Flora DB, Curran PJ. An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. Psychol Methods. 2004;9:466 – 491. 12. Jo¨reskog KG, So¨rbom D. PRELIS 2 User’s Reference Guide: A Program for Multivariate Data Screening and Data Summarization; A Preprocessor for LISREL. Chicago, IL: Scientific Software International; 1996. 13. Jo¨reskog KG, So¨rbom D. LISREL 8: User’s Reference Guide. Chicago, IL: Scientific Software International; 1996. 14. Muthe´n LK, Muthe´n BO. Mplus User’s Guide. 3rd ed. Los Angeles, CA: Muthe´n & Muthe´n; 1998 –2004. 15. Jo¨reskog KG. New developments in LISREL: analysis of ordinal variables using polychoric correlations and weighted least squares. Qual Quantity. 1990;24:387– 404. 16. Browne MW. Asymptotically distribution-free methods for the analysis of covariance structures. Br J Math Stat Psychol. 1984;37:62– 83. 17. Muthe´n B. A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika. 1984;49:115–132. 18. Browne MW, Cudeck R. Alternative ways of assessing model fit. Sociol Methods Res. 1992;21:230 –258. 19. Tucker LR, Lewis C. A reliability coefficient for maximum likelihood factor analysis. Psychometrika. 1973;38:1–10. 20. McDonald RP. Test Theory: A Unified Treatment. Mahwah, NJ: Lawrence Erlbaum Associates; 1999. 21. Thissen D, Chen W-H, Bock RD. Multilog (Version 7) 关Computer Software兴. Lincolnwood, IL: Scientific Software International; 2003. 22. Schumaker RE, Lomax RG. A Beginner’s Guide to Structural Equation Modeling. Mahwah, NJ: Erlbaum; 1996. 23. Langer MM, Hill CD, Thissen D, et al. Detection and evaluation of differential item functioning using item response theory: an application to the Pediatric Quality of Life Inventory™ (PedsQL™) 4.0 Generic Core Scales. J Clin Epidemiol. In press.
S47