ITEM FIT IN DISCRETE DATA 1 Item diagnostics in ...

ITEM FIT IN DISCRETE DATA

1

Item diagnostics in multivariate discrete data Alberto Maydeu-Olivares University of Barcelona Yang Liu University of North Carolina

Author note This research was supported by an ICREA-Academia Award and grant SGR 2009 74 from the Catalan Government and grants PSI2009-07726 and PR2010-0252 from the Spanish Ministry of Education. Correspondence concerning this article should be addressed to Alberto Maydeu-Olivares, Faculty

of Psychology, University of Barcelona. P. Vall d’Hebron, 171. 08035 Barcelona. E-mail: [email protected].


2 Abstract

Researchers who evaluate the fit of psychometric models to binary or multinomial items often look at univariate and bivariate residuals to determine how a poorly fitting model can be improved. There is a class of z statistics and also a class of generalized chi- square statistics that can be used for examining these marginal fits. We describe these statistics and compare them with regard to the control of Type I error and statistical power. We show how the class of z statistics can be extended to accommodate items with multinomial response options. We provide guidelines for the use of these statistics, including how to control for multiple testing, and present two detailed examples that illustrate how to use these methods to improve the fit of widely used item response theory models.

Keywords: Goodness of fit, categorical data, IRT, local dependencies, mixture


3

Item diagnostics in multivariate discrete data Psychological applications often involve fitting psychometric models to discrete data. A typical example is the construction of an instrument to measure a psychological construct. After, say, an item response theory (IRT) model has been fitted to data gathered using the instrument, a researcher examines the overall fit of the model. When the overall fit is deemed satisfactory, she may wish to examine how well the model fits each item, for some items may not be well fitted by the model even though the overall (i.e., average) fit is satisfactory. Alternatively, when the overall fit is considered unsatisfactory the researcher will want to locate and understand the nature of the misfit. To identify misfitting items and to gain insights into how to modify the model, it is often of interest to examine the observed vs. expected endorsement rates of individual items, or the observed vs. expected association of an item with each of the other items in the model. There is a substantial amount of literature on this topic in multivariate discrete data, primarily in the item response theory (IRT) literature (e.g., Cagnone & Mignani, 2007; Chen & Thissen, 1997; Glas & Verhest, 1995; Glas & Suarez-Falcón, 2003; McKinley & Mills, 1985; Orlando & Thissen, 2000, 2003; Reise, 1990; Tay & Drasgow, 2011; Toribio & Albert, 2011). However, an investigator wishing to perform such an assessment is faced with an array of statistics to choose from; some of them have unknown sampling distributions, others appear to be valid only for certain models, and yet others appear to be valid only for detecting certain types of misfit. In this article we describe two methods that enable researchers to examine how well a multivariate model for categorical data fits individual items and the associations between one item and others: z statistics, and a class of statistics related to Pearson’s X2 statistic. The methods described have known asymptotic sampling distributions, they can be applied to any model, and they are all-purpose in the sense that they are not designed to detect any specific


4

type of misfit. We now provide a hypothetical example to introduce the topic. A hypothetical example: Assessing item exchangeability Suppose an investigator has three items that she believes will be endorsed at the same rate as each other. In other words, she believes the items are exchangeable. Suppose further that 1,000 persons are administered these items, yielding the response frequencies for each pattern shown in Table 1. If the investigator fits a model that is consistent with the assumption that the items are exchangeable, what evidence can be examined that will suggest that the model fits or not? When the items are exchangeable, every pattern with the same sum score (the same number of items endorsed) is equally likely. This means that all patterns with the same sum score should have a similar observed frequency. We do not see this clearly happening in this example. Thus, it appears that the exchangeable model does not fit well. We can quantify the discrepancy between the observed and expected frequencies by computing an overall goodness of fit statistic. To do so, we first need to compute the expected frequency of each response pattern. Table 1 includes the probabilities for all response patterns assuming that items are exchangeable. Under an exchangeable model, the probabilities depend on a single parameter, , the probability of endorsing an item, which is assumed to be common to all items. We use maximum likelihood to estimate this parameter obtaining ˆ ML = 0.413 and the expected frequencies shown in Table 1. Now, the two most widely used test statistics for assessing the fit of a categorical data model are the likelihood ratio test, G2, and Pearson’s statistic, X2. Both test statistics have the same asymptotic distribution when the fitted model holds (Agresti, 2002). They follow a chi-square distribution with six degrees of freedom in this example. The estimated statistics are X2 = 12.60, p = 0.05 and G2 = 12.99, p = 0.04. Thus, using a significance level of 5% the model barely fits.


5

Suppose now that, confronted with this evidence, our investigator wishes to obtain a better fitting model. To do so, she must identify where the model misfits. This can be done by comparing the observed and expected frequencies for each pattern. Looking at Table 1 we see that this is quite difficult to do in this example, because of the sampling variability of the observed frequencies. To account for this sampling variability our investigator can compute a z-statistic for each response pattern by taking the difference between the observed and expected frequency for each response pattern and dividing it by its standard error. These zstatistics for cell residuals 1 follow, asymptotically, a standard normal distribution when the model holds (Agresti, 2002). They are also shown in Table 1, where we see that there is only one statistically significant residual at the 5% level, that of pattern (0, 1, 1). Not surprisingly, our investigator does not find this information very helpful in suggesting ways to modify her model to obtain a better fit to the data. Insert Table 1 about here Our investigator considers applying Pearson’s X2 to each item separately (to assess each item’s fit) as well as applying X2 to each pair of items (to assess the existence of residual associations between pairs of items not accounted for by the model, very much as one inspects residual correlations in factor analysis) because the results of such a piecewise analysis are much easier to interpret than the z statistics for cell residuals. However, she is unsure as to how to compute degrees of freedom, and more generally, about whether the procedure is justified. We discuss this approach in the next section. We point out that, for a piecewise assessment of this kind, a generalized version of Pearson’s X2 needs to be used in lieu of the actual X2. Depending on the piecewise assessment of interest there may not be enough degrees of freedom for testing using a generalized X2 statistics. This problem can be


6

overcome by using z statistics instead. Z statistics are discussed in section three. For binary data, Reiser (1996) showed how one can assess the fit to the model of an item by computing a z statistic, and how to use a z statistic to assess the fit of the model to a pair of items. In this paper we extend this procedure to the case of polytomous ordinal items. In the polytomous case, researchers may be interested in an even more detailed analysis: identifying the categories where the misfit is located (e.g., is it the lowest category?). This can also be investigated by using suitable z statistics. Necessarily, the use of these piecewise diagnostics leads to a multiple testing problem. The best solution to this problem involves the use of an overall goodness of fit statistic. X2 or G2 cannot be used when the overall table is sparse, and in section four we briefly discuss alternatives to these test statistics that are better suited to the sparse tables encountered in applications. In section five we report the results of a small simulation study to examine the empirical Type I error of the generalized X2 and z statistics. The fitted model is Samejima’s (1969) IRT model with a single latent trait (and its special case, the two parameter logistic model). We also investigate the empirical power of these statistics to detect multidimensionality and the presence of mixtures when this model is fitted. We conclude by providing some guidelines on the use of these statistics, and two detailed examples. The article is intended to be non-technical. Technical details are provided in an Appendix. Generalized chi-square statistics Let n be the number of items, K be the number of response alternatives2 per item, and N be sample size. The observed responses to the items can be gathered in a contingency table with C  K n cells, where each cell c is one of the possible response patterns. To assess the overall fit of the model we can use Pearson’s statistic C

 pc  ˆ c 

c 1

ˆ c

X  N 2

2

,

where pc is the observed proportion of responses to pattern c and c is the expected

(1)


7

probability according to the model being tested. For our example, these proportions and estimated probabilities multiplied by sample size are given in columns 2 and 4 in Table 1. For asymptotically optimal (i.e., minimum variance) estimators such as the maximum likelihood (ML) estimator, X2 asymptotically follows a chi-square distribution with C  q  1 degrees of freedom, where q is the number of parameters to be estimated. However, when the estimator is not optimal, Pearson’s X2 does not follow this distribution3. For instance, consider the model and data presented in Table 1. The single parameter of this model can be estimated by minimizing the unweighted least squares function between the observed cell C

proportions and theoretical probabilities

 p c 1

c

2   c  yielding ˆ ULS = 0.412. Because the

least squares estimator is not optimal, a statistic other than X2 must be used. One test statistic that can be applied in conjunction with non-optimal estimators is Mn (Maydeu-Olivares & Joe, 2005). Mn is asymptotically chi-square with C – q – 1 degrees of freedom for any consistent estimator (such as the unweighted least squares estimator). The

Mn statistic is most easily presented in matrix form. In matrix form, X2 can be written as ˆ 1  p  ˆ  , X 2  N  p  ˆ  D

(2)

where  and p denote vectors of length C containing the probabilities and proportions respectively, and D is a diagonal matrix with the probabilities along the diagonal. Mn, on the other hand, can be written as ˆ  p  ˆ  , M n  X 2  N  p  ˆ  U



U  D 1Δ ΔD 1Δ



1

ΔD 1 .

(3)

where  is a C  q matrix that contains the derivative of each response pattern with respect to each of the model parameters. Therefore, Mn can be seen as a generalization of X2 that incorporates a term to account for the possible non-optimality of the estimators. When the ML estimator is used, the second term in the expression for Mn equals zero (Maydeu-Olivares & Joe, 2005), and Mn and X2 are algebraically equal.


8

Testing using non-optimal parameter estimates is of interest when assessing the fit of the model to the margins of the table. By this we mean estimating the model parameters using all the items and investigating how well the model reproduces the responses to a single item (a univariate margin), or to a pair of items (a bivariate margin). More formally, a univariate margin is the marginal frequency distribution of an item obtained by summing over the appropriate response patterns; a bivariate margin is the marginal distribution of two items obtained by summing over the appropriate response patterns4. Consider now the assessment of the fit of a model to a univariate or a bivariate margin using Pearson’s X2. For an item (a univariate margin) we write K 1

 pi  ˆ i 

i 0

ˆ i

X  N 2 i

2

.

(4)

where we have coded the item’s categories using 0, 1, …, K – 1. For a pair of items (a bivariate margin), the summation is over all pairs of categories: K 1 K 1

X  N  2 ij

i 0 j 0

p

ij

 ˆ ij  ˆ ij

2

.

(5)

Just like X2, Mn can be applied to single items or to pairs of items, in which case we write Mi and Mij respectively.

Mi and Mij follow an asymptotic chi-square distribution with K – qi – 1 and K2 – qij – 1 degrees of freedom, where qi and qij denote the number of parameters involved in item i, and in items i and j respectively. Consider the exchangeable items example in Table 1. Because the items are binary and there is only one parameter being estimated, there are 2  1  1 = 0 degrees of freedom for testing using Mi. We cannot assess how well the model reproduces the marginal frequencies for each item separately using Mi because we do not have enough degrees of freedom. As shown in the Appendix, when degrees of freedom equal zero, the M statistics equal zero. For Mij we have 4  1  1 = 2 degrees of freedom, and so in this case we can test how well the model reproduces the marginal frequencies of each pair of items.


9

In contrast, X i2 only follows an asymptotic chi-square distribution when the parameters are optimally estimated for the univariate margin of item i, and X ij2 when he parameters are optimally estimated for the bivariate margin involving items i and j. To illustrate this point, we return to our exchangeable items example. When estimating the single parameter of this model in an n item test, the MLE is ˆ ML  ( p1  p2    pn ) / n . On the other hand, the MLE when estimated using a single item is ˆ i  pi , and it is

ˆ ij  ( pi  p j ) / n when estimated using two items. X i2 follows an asymptotic chi-square distribution if ˆ i is used since this is an optimal estimator for the univariate margin. Similarly, X ij2 is asymptotically chi-square if ˆ ij is used. Further, if ˆ i used, X i2  M i , and if ˆ ij is used, X ij2  M ij . But when ˆ ML (the estimator obtained from all the data) is used X i2 and X ij2 do not follow a chi-square distribution5. Furthermore, in this case X i2  M i and X ij2  M ij (Maydeu-Olivares & Joe, 2006).

Indeed, the estimated X i2 in this example are 7.16, 1.39, and 2.24, whereas for all items Mi = 0 (because degrees of freedom are zero in this case). The X ij2 and Mij statistics obtained using ˆ ML are presented in Table 2. We see in this table that X ij2  M ij which means that the p-values for X ij2 obtained using a chi-square distribution are too small, and they suggest that the model fits more poorly than it actually does.

Insert Table 2 about here

The results obtained using Mij are more helpful than the z statistics for response patterns reported in Table 1 to locate the misfit of the exchangeable model to this example. The results in Table 2 suggest that item 1 is not well fitted by the model because Mij for pairs


10

(1,2) and (1,3) are statistically significant. A more direct approach for detecting whether the model misfits item 1 would be to use a test statistic for item 1, which is not possible using Mi because of lack of degrees of freedom. For some models there may be no degrees of freedom available for testing using Mij either. The problem of lack of degrees of freedom for testing the fit of items and pairs of items after a multivariate model has been fitted may be avoided altogether by applying z statistics to margins. Using z-statistics for piecewise model diagnostics zi and zij statistics for binary data Consider a model for multivariate binary data, such as the exchangeable items model. For each item we have two probabilities that add up to one. Therefore, to test how well the model fits an item we need focus on only one of them. Arbitrarily, we select the probability of individuals that endorse the item. A z statistic for assessing the fit of the model to a binary item can then be obtained as the residual proportion of respondents that endorse the item divided by its standard error:

zi 

pi  ˆ i . SE ( pi  ˆ i )

(6)

In (6), pi is the observed proportion of respondents that endorse item i, and ˆ i is the expected probability of endorsing the item which depends on the estimated item parameters. Because the residual proportions are asymptotically normally distributed if the model is correctly specified, by construction, the resulting z statistic follows a standard normal distribution. Details on how to compute these standard errors are provided in the Appendix. Consider now computing a z statistic for assessing how well the model fits a pair of binary items i and j. There are four cells in a 2  2 table whose probabilities add up to one. Thus, we only need to consider three cells for testing. But if we test the fit of the model to each item using the univariate statistics zi and zj we only need to consider one additional cell.


11

Let ˆ ij be the expected probability of endorsing both items and let pij be the observed

proportion; then, a z statistic for testing a pair of binary items is

zij 

pij  ˆ ij SE ( pij  ˆ ij )

.

(7)

In Table 2 we present the zi and zij statistics for the exchangeable items model. There are six z statistics in this table. Using a Bonferroni adjustment to control the overall Type I error of 5% leads to a critical value of 2.63. Only the z statistic for item 1 is larger than this critical value. The z statistics suggest that item 1 is not fitted well by the exchangeable items model. This is the same conclusion reached previously using Mij, in that item 1 was involved in both misfitting item pairs: (1,2) and (1,3). From an applied perspective, the use of the zi and zij statistics proposed by Reiser (1996) is preferable to the use of Mi and Mij statistics because they allow us to test the fit of models for binary data for which there may not be degrees of freedom for testing using Mi and Mij. For polytomous items, the lack of degrees of freedom may still be an issue, particularly for testing the fit of single items. Below we propose an extension of Reiser’s statistics to items that take more than two categories. zi and zij statistics for polytomous ordinal data

When the binary data are coded as 1 (endorse) and 0 (not endorse) the proportion and probability of respondents that endorse an item are the sample and the population mean respectively. Similarly, the proportion and probability of respondents that endorse items i and j are the sample and population cross-product between the items. For polytomous data coded as 0, 1, …, K – 1, the population mean of item i is  i  0  Pr Yi  0     ( K i  1)  Pr Yi  Ki  1 ,

(8)

where Yi denotes the random variable associated to item i, and the probabilities of each category depend on the model parameters. Using ki to denote the sample mean, ki   yi / N , a z statistic for a polytomous variable can be computed as the residual mean divided by its


12

standard error zi 

ki  ˆ i . SE ( ki  ˆ i )

(9)

Similarly, a z statistic for a pair of polytomous variables can be computed as zij 

kij  ˆ ij SE ( kij  ˆ ij )

(10)

where kij   yi y j / N is the sample cross-product, and  ij  0  0  Pr Yi  0, Y j  0      ( Ki  1)  ( K j  1)  Pr Yi  Ki  1, Y j  K j  1

(11)

is the population cross-product, which depends on the model parameters. When data are binary, (9) and (10) reduce to (6) and (7) respectively. For polytomous items with unordered categories, there is no point in computing these statistics, as the assignment of numerical labels to categories is arbitrary. Even when the item categories are ordered these statistics should be interpreted cautiously, simply as devices that may or may not be useful for detecting the source of misfit in models for polytomous ordered data. z statistics for specific categories of polytomous items

When a model fails to fit well a polytomous item (or a pair of polytomous items), we may be interested in identifying the particular categories within the item (or pairs of categories when a bivariate statistic is employed) that are responsible for the misfit. In particular, when the items are ordinal we may be interested in knowing whether the misfit is located in the low, middle, or high category. This can be accomplished by computing a zstatistic for each category within an item, and for each pair of categories within a pair of items. These z statistics can be computed for polytomous unordered categories as well. Let il  Pr(Yi  l ) denote the probability that item i takes category l. Also, let il jm  Pr(Yi  l ,Y j  m ) denote the probability that items i and j take categories l and m, respectively. We use pil and pil jm to denote their corresponding sample proportions. Then,


13

the z statistics for a univariate cell residual and for a bivariate cell residual are zil 

pil  ˆ il SE ( pil  ˆ il )

,

zil jm 

pil jm  ˆ il jm SE ( pil jm  ˆ il jm )

,

(12)

respectively. The z statistics (12) should be distinguished from the standardized residuals (Agresti, 2002) eil 

N ( pil  ˆ il ) ˆ il

,

eil jm 

N ( pil jm  ˆ il jm ) ˆ il jm

,

(13)

that is, the components of X i2 and X ij2 . For instance, consider the bivariate marginal table for items 1 and 2 in our exchangeable items example. The four standardized residuals (13) are {0.22, 1.44, -2.41, 1.46}, and their sum equals the X2 statistic for that bivariate marginal table, 10.08. In contrast, the z statistics (12) are {-0.39, 1.67, -2.78, 2.06}. The standardized residuals are highly correlated with the z statistics – above 0.99 in this example. However, their distribution is not standard normal. Therefore, the standardized residuals provide information about which pairs of cells show the highest misfit, but do not provide information about whether the observed misfit is statistically significant. Controlling the overall Type I error

Necessarily, assessing the fit of every item and every pair of items after a model has been fitted to the contingency table of n items leads to a multiple testing problem. A straightforward way to address this issue is by applying a Bonferroni correction to the nominal significance level. This is analogous to using all possible t-tests with Bonferroni corrections to assess the existence of mean differences among three or more treatments. A more powerful method is the Benjamini-Hochberg method (Benjamini & Hochberg , 1995; Thissen, Steinberg & Kuang, 2002) and we use this method in our examples later on. However, the optimal approach to the multiple testing problem in ANOVA involves using the F-ratio to test the overall null hypothesis of no mean differences. Only when this F-ratio


14

suggests statistically significant mean differences should t-tests for pairwise differences be examined (with a Bonferroni, Benjamini-Hochberg, or some other type of correction). Analogously, when examining the fit of a model for multivariate discrete data, an overall goodness of fit test must be performed first. If the model is rejected, a piecewise fit assessment using the methods described in this paper should be performed. In contrast to ANOVA, but as recommended in factor analysis, we suggest that a piecewise fit assessment be performed even when the model cannot be rejected, since the piecewise assessment may reveal that some parts of the model are not fitted well. The likelihood ratio test, G2, and Pearson’s statistic, X2, may be used to assess the model’s overall fit. However, the asymptotic p-values for X2 and G2 are accurate only when the expected frequencies for all response patterns are large (> 5 is the usual rule of thumb). A practical way to evaluate whether the asymptotic p-values for X2 and G2 are accurate is to compare them. If the p-values are similar, then both p-values are likely to be correct. If they are very different, both p-values are most likely incorrect. If they are slightly dissimilar, then X2 yields the most accurate p-value (Koehler & Larntz, 1980). Unfortunately, as the number of possible response patterns increases, their expected frequencies must decrease because the sum of all probabilities must be equal to one (Bartholomew & Tzamourani, 1999). As a result, in multivariate discrete data analysis the asymptotic p-values for the overall X2 and G2 cannot usually be trusted. In fact, when the number of categories per item is large (say > 4) the asymptotic p-values almost invariably become inaccurate as soon as n > 5. How inaccurate? In models with a large number of response patterns6 (say > 1000), G2 often yields a p-value of 1, and X2 a p-value of 0. In recent years, a number of statistics have been proposed that overcome the problem of low expected counts when testing the overall goodness of fit of a multivariate model for discrete data (often referred to in the literature as “data sparseness”). In this article’s


15

examples we use Maydeu-Olivares and Joe’s (2005, 2006) overall statistic, M2. This statistic may be used with binary or polytomous ordered or unordered data. It only uses univariate and bivariate information for testing and it follows an asymptotic chi-square distribution for any consistent estimator. Its asymptotic p-values are accurate even in very large models and small samples. When all variables are binary and item parameters have been estimated by ML, alternative statistics may be used. Reiser (1996) proposed an alternative statistic which differs from M2 in the weight matrix employed and also follows an asymptotic chi-square distribution. Cai, Maydeu-Olivares, Coffman and Thissen (2006) considered an alternative statistic first introduced by Bartholomew and Leung (2002) and describe how to obtain pvalues for it using mean and variance corrections similar to those introduced by Satorra and Bentler (1994) in the context of structural equation modeling. Cagnoni and Mignani (2007) introduced an extension of Reiser’s statistic for polytomous items. Further details on these overall test statistics are provided in the Appendix. For our exchangeable items example M2 = 12.59 on 5 df, p = 0.03. The M2 statistic belongs to the M family of test statistics, which includes from M1, M2, … Mn. In M1 only univariate information is used, in M2 univariate and bivariate information is used, up to Mn where all information available in the data is used. For maximum power over most alternatives, Maydeu-Olivares and Joe (2005, 2006) recommend testing at the smallest order of information that the model is identified. Most IRT models are identified (i.e., they can be estimated) using bivariate information. But the exchangeable items model can be estimated only using univariate information (the MLE for this model equals the average of the proportion of respondents endorsing each item). Thus, for this test we can compute M1 = 10.79 on 2 df, p = 0.005. Recall that for this example M3 ≡ Mn = X2 (because we use the MLE) = 12.60 on 6 df, p = 0.05. Why do the p-values become smaller as we decrease the


16

order of information we use for testing? Because within this family of statistics the power of the statistics increases as we decrease the order of information (Joe & Maydeu-Olivares, 2010). This is most easily seen by computing the RMSEA associated to each of these statistics (Maydeu-Olivares & Joe, 2014): RMSEA3 = 0.033 (the RMSEA associated to the overall X2), RMSEA2 = 0.039, RMSEA1 = 0.066. The RMSEAs increase as the order of information increases because the RMSEAs are a function of the non-centrality parameter per degree of freedom of the statistic. Note that to avoid confusion, we use numeric subscripts to refer to statistics that assess the overall fit of the model M1, M2, …., whereas we use Mi, Mij to denote statistics that assess the fit of the model to single items, item pairs, etc. Empirical behavior of X ij2 , Mij and z statistics for piecewise assessment

We report the performance of the statistics described using a small simulation study. One purpose of the study is to investigate the extent to which p-values for X ij2 depart from the nominal rates under a reference chi-square distribution. Another aim is to investigate whether the asymptotic p-values of the zi, zij and Mij statistics, as well as the z statistics for specific categories, are accurate in small samples. For statistics with accurate empirical Type I errors, we also examine their power to reject some model misspecifications that may be of interest in applications. The fitted model in all cases is Samejima’s graded logistic IRT model with two and four categories per item (i.e., K = 2 or 4). For binary data, this model reduces to the twoparameter logistic (2PL) model. We assume a model with a single latent trait (i.e., a unidimensional model) and we also assume that the latent trait follows a standard normal distribution. In the two-parameter logistic model with the items coded as 0 or 1, the item response function is

ITEM FIT IN DISCRETE DATA Pr Yi  1|     i  i  

17 1 , 1  exp  i  i 

(14)

with Pr Yi  0 |   1  Pr Yi  1|  . In (14)  denotes the latent trait, i denotes the intercept for item i and i denotes the slope. The model can be extended to model ordered responses like those obtained from rating items. More specifically, a two-parameter model is used to model the probability of endorsing categories 1 or higher, 2 or higher, etc. That is, in this model the option response function is specified as  1    i ,1  i  if  Pr Yi  k      i ,k  i     i ,k 1  i  if    i ,m 1  i  if 

k 0 0  k  K  1.

(15)

k  K 1

The item parameters were estimated using maximum likelihood7 (Bock & Aitkin, 1981) and the asymptotic covariance matrix of the item parameter estimates (needed to compute the z statistics – see the Appendix) was computed using the expected information matrix. Two sample sizes were used: N = 100 and 1000. In all cases, 1000 replications per condition were used. Nine items were used when K = 2, and six when K = 4. Fewer items were used in the polytomous case to reduce the number of response patterns, as the expected information matrix involved in our computation of the z statistics involves all possible response patterns. With nine items and K = 2 there are 29 = 512 patterns, but for K = 4 there are 49 = 262,144 patterns. Reducing the number of items to only six leads to 46 = 4,096 patterns. We used nine items instead of six for K = 2 so that the contingency table would be sufficiently sparse. Correct model specification results: Empirical Type I errors

For K = 2, data were generated using ´= (2.269, 1.668, 1.276) and  = (-1.418, 0, 1.064) repeated thrice. For K = 4, data were generated using the model above with the


18

following parameter values: ´= (2.269, 1.668, 1.276) and  = ((1.418,0, -1.418),(1.191,0, 1.191), (1.064,0, - 1.064)), repeated twice. These parameter values are equivalent to factor loadings     0.8,0.7,0.6  and thresholds    0.5,0,0.5 in an ordinal factor analysis parameterization (Takane & de Leeuw, 1987; Flora & Curran, 2004). For K = 4, the values used are equivalent to using i   0.5,0,0.5 for all items. The binary parameter values used are typical in personality assessments applied to general populations8. The threshold values used for K = 4 lead to items with a higher probability of endorsing the extremes than the middle categories, as found in applications displaying somewhat extreme responding. Empirical rejection rates for zi at the 5% significance level ranged between 0.05 and 0.07 even in the smallest sample considered, N = 100, regardless of the number of categories per item. Due to the lack of degrees of freedom, M statistics cannot be used to assess the fit of the model to single items; nor can they be used to assess the fit to pairs of binary items9. Results for bivariate statistics (zij and Mij) are reported in Table 3. For conciseness, only the results for pairs (1, 2), (1, 3), and (2, 3) are presented, which cover all three slope combinations. For K = 2, only zij statistics are shown in this table. For polytomous data, the asymptotic distribution for Mij is chi-square with seven degrees of freedom10. This table also shows Type I error rates for X ij2 using this reference distribution. In this table we see that the rejection rates for Mij and zij are right on target even at N = 100. In contrast, the empirical rejection rates for X ij2 are too large. The use of X ij2 to assess the fit of Samejima’s model to pairs of items will result in the incorrect rejection of some well-fitting item pairs. Insert Table 3 about here When a misfit is found in a polytomous item or in a pair of polytomous items, it may be of interest to identify the item category or pairs of item categories in which the misfit is


19

located. The Type I error rates in the simulations were right on target: their median was .05 for both sample sizes with values that ranged from .04 to .07. Detailed results are provided in the supplementary materials to this article, which can be downloaded from . Incorrect model specification results: Empirical power

The IRT model fitted in the previous subsection makes five basic assumptions: a) responses to items are conditionally independent on a single latent trait, b) the trait is normally distributed, c) the item response function is obtained by taking differences of ‘block’ functions, d) the block functions have a cumulative logistic form, e) responses arise from a single population. Any of these assumptions may be incorrect in applications. For instance, assumption a) will be violated if there are two or more latent traits underlying the responses. Assumption b) will be violated if the latent trait is skewed, or multimodal (e.g., Woods, 2006). Assumption c) will be violated if the response process involves taking the ratio of a ‘block’ function over the sum of all blocks as in Muraki’s (1992) generalized partial credit model (see Thissen & Steinberg, 1986). Assumption d) will be violated if the block functions are not cumulative logistic. For instance, if an ideal point process underlies the responses, then a normal distribution density function would be a more appropriate function (Maydeu-Olivares, Hernández & McDonald, 2006). Finally, assumption e) will be violated if the sample comes from a mixture of different populations of responses (Bolt, Cohen & Wollack, 2001). For this study we selected a choice of misspecification in which we expected high power (multidimensionality) and another in which we expected low power (mixtures). More specifically, we considered three misspecified models: a) a mixture of two one-dimensional models whose latent trait means differ by one standard deviation; b) a mixture of two onedimensional models whose latent trait means differ by three standard deviations; and c) a


20

multidimensional independent clusters model (i.e., each item is an indicator of a single trait, but there are multiple traits, each with three indicators) where the latent traits correlate 0.3. More specifically, the mixture data were generated using the same parameter values as in the correct specification condition. Fifty per cent of the sample was generated using a standard normal distribution; the other 50% was generated using a normal distribution with a standard deviation of 1 and a mean of 1 (or 3). The multidimensional data were generated using an independent clusters configuration with the same sequence of slopes for each dimension and the same values used in the correctly specified condition. Trial runs revealed that power for the alternatives under consideration is not high, and therefore results are only shown for N = 1000. We did not compute the power for X ij2 due to the lack of a reference asymptotic distribution. Empirical rejection rates at  = 0.05 for zi statistics (i.e., for testing one item at a time) are reported in Table 4. In this table we see that these statistics have little power to detect the presence of multidimensionality (empirical rejection rates are not larger than the Type I error rates). Also, there is little power to detect the presence of mixing when the latent trait means differ by one standard deviation, but power is high at this sample size to detect the presence of mixtures when the separation is three standard deviations. In this case, power is higher in the polytomous case than in the binary case. Empirical rejection rates of zij and Mij (i.e., statistics for testing the fit to pairs of items) are also reported in Table 4. As we can see in this table, for binary data and a sample size of 1000 respondents, the bivariate zij statistic has maximum power to detect the presence of this type of multidimensionality when both items belong to the same dimension. Power is lower when the items belong to different dimensions; it also decreases as the average slope decreases. In the polytomous case, power is lower than in the binary case when both items belong to the same dimension, but it is higher than in the binary case when the items belong


21

to different dimensions. In contrast, the Mij statistics show very low power to detect multidimensionality, even at this sample size: Empirical power equals the size of the test except when both items belong to the same dimension and their slopes are high. Empirical power equals the size of the test to detect the presence of mixtures when the latent trait means separation is small (one standard deviation) regardless of whether zij or Mij statistics are used, but there is power to detect the model misfit when the mean separation is large (three standard deviations). In this case, the zij statistics are more powerful than the Mij statistics. Also, power is higher the larger the average slope, and the power of zij statistics is larger in the polytomous case than in the binary case. Empirical rejection rates of z statistics for each specific category and for each specific pair of categories are shown in Table 5 when the variables are polytomous. The power is low (roughly equal to the size of the test) for the univariate statistics to detect the presence of multidimensionality or a mixture with a one standard deviation mean difference. Bivariate statistics also show power roughly equal to the size of the test to detect the latter model. The bivariate statistics have power to detect the presence of multidimensionality, and power increases slightly as the average slope increases. Also, power is highest when both items take the highest or lowest categories  e.g., for category pairs (0, 0) or (3, 3), and lowest when the two response categories are adjacent  e.g., (0, 1) and (2, 3). The bivariate statistics exhibit highest power to detect the presence of a mixture with a three standard deviation mean difference when both items take the most extreme category, and the power increases as the average slope increases; the univariate statistics for categories 1 and 3 show very high power (> 0.9) to detect this type of misfit, for category 0 the statistics show moderate power, and for category 2 they show very low power (< 0.1). Because the power of z statistics for specific categories is far from uniform across categories researchers should be careful when interpreting these statistics. In applications, significant z statistics for specific


22

categories may reflect differences in fit across categories, but they also may reflect differences in power. Insert Tables 4 and 5 about here Discussion

In our simulation studies the empirical rejection rates of the z and Mij statistics match the expected rates under correct model specification even when the sample size is only 100 observations11. In contrast, X ij2 over-rejects the model when using as reference a chi-square distribution with degrees of freedom equal to number of cells minus the number of parameters involved in the marginal table minus one (i.e., the reference distribution of Mij). The extent to which empirical rejection rates of Pearson’s statistics applied to marginal tables will differ from nominal rates under this reference distribution will differ from model to model and from estimator to estimator. The extent of the over-rejection in our simulations is not large: 0.09 at the  = 0.05 level. In some instances, rejection rates for Pearson’s statistics will be larger than those reported here. Liu and Maydeu-Olivares (2013) applied these statistics to test a correctly specified 2PL model using triplets of variables. Mijk follows12 a chi-square distribution with one degree of freedom when testing the 2PL. Using this reference distribution leads to empirical rejection rates for X ijk2 above 0.20 at the  = 0.05 level. But for some models and some estimators, X ij2 may yield empirical Type-I errors close to the nominal rates. This appears to be the case when an ordinal factor analysis is estimated using unweighted least squares from polychoric correlations (Muthén, 1993). Simulation results for this model and estimation method are reported in the supplementary materials to this article which can be downloaded from . Overall M2 statistics were computed for all the misspecified models considered


23

previously. Degrees of freedom are 27 when K = 2, and 315 when K = 4. When testing the three dimensional model, the empirical rejection rates at  = 0.05 was 1 for both K = 2 and 4. When testing mixtures, the empirical rejection rates were 0.04 and 0.01 for K = 2, 4 and a one standard deviation separation, and 0.22 and 0.67 for a three standard deviation separation. Comparing these results with those in Table 4, we notice that when an overall test is performed more power is obtained to reject the multidimensional model than if a piecewise assessment is performed, but less power is obtained to reject the mixture alternatives. These results suggest that in applications the overall tests may be more powerful than tests for individual items or for pairs of items. Hence it is possible that a model be rejected by the overall test statistic, but that the piecewise diagnostic statistics lack power to identify the source of the misfit. When a misfit is found in an item (or pair of items), researchers may be interested in locating the categories within the item (or bivariate cells if a pair of items is being considered) in which the misfit is located. This can be investigated using suitable z statistics (the residual proportion of interest divided by its standard error). Our results reveal that these z statistics also have adequate empirical Type I errors. However, our results also reveal that the power of these z statistics is not uniform within a univariate table (or bivariate table). This means that in applications, the z statistics may suggest, for instance, that the lowest category is poorly fit for all items. However, this result may reflect differences in power across categories, not necessarily differences in fit; therefore, researchers should use these statistics cautiously. Guidelines for applied users

We recommend that an overall goodness of fit be performed first, to avoid rejecting a well-fitting model. Whenever the number of possible response patterns is small, this can be performed using the well-known X2 and G2 statistics. If the number of possible response


24

patterns is large, and the asymptotic p-values of X2 and G2 differ widely, we recommend applying an overall goodness of fit statistic that only makes use of the low order margins of the contingency table, such as M2, as asymptotic p-values for these statistics can be trusted in situations of data sparseness. A piecewise fit assessment should be performed next, to reveal the source of misfit in the case of poorly fitting models, and to identify parts of the data that are not well fitted by models that closely fit the rest of the data. This piecewise fit assessment can be readily performed using zi statistics for single items and zij statistics for pairs of items provided the items are binary or polytomous ordinal. Our simulation results reveal that these statistics have adequate empirical Type I errors, as good as Mij but in general more powerful. The univariate and bivariate z statistics may be presented in matrix form with univariate statistics along the diagonal. It is necessary to control for multiple testing when inspecting the results of these z statistics. The easiest way to do so is by performing a Bonferroni adjustment. A more powerful method is the Benjamini-Hochberg method (Benjamini & Hochberg, 1995; Thissen, Steinberg & Kuang, 2002) used in this paper. When the items are polytomous and their categories are unordered, there is no point in computing the zi and zij statistics (9) and (10). In this case, Mij statistics for pairs of items can be computed, provided there are degrees of freedom for testing. The Mij statistics can be presented in matrix form to facilitate the interpretation of the results. To this end, we have also found it helpful to inspect the average of the Mij statistics involving each item. For every item (or pair of items) flagged as misfitting using the above procedures, researchers may wish to inspect the z statistics for each of its categories (or for each pair of categories), as these statistics may reveal the location of the misfit more specifically. Generally we do not do this for binary items and, in any case, it is recommended that the Benjamini-Hochberg method be used to control for multiple testing. We now apply these


25

guidelines to two examples. Numerical examples

In the first application, we examine the fit of the two-parameter logistic model to the short form of the Extraversion (E) scale of Eysenck’s Personality Questionnaire-Revised (EPQ-R: Eysenck, Eysenck, & Barrett, 1985). In the second application, we examine the fit of Samejima’s logistic graded model to the Positive Problem Orientation (PPO) scale of the Social Problem Inventory-Revised (SPSI-R: D'Zurilla, Nezu & Maydeu-Olivares, 2002). In both cases, a one-dimensional model is called for according to substantive theory. The item parameters are estimated using maximum likelihood using IRTPRO (Cai, Thissen & du Toit, 2011) and saved to an external file. We use these estimates to compute the Mij , X ij2 and z statistics using R (R Development Core Team, 2012). This R code can be downloaded from the . Short form E scale of the EPQ-R

The female UK normative data for this scale, kindly provided by Paul Barrett and Sybil Eysenck, are analyzed. The sample size is 824. The scale consists of 12 binary items. A typical item is ‘Are you rather lively?’. The response categories are ‘Yes’ and ‘No’, coded as 1 and 0 respectively. Items negatively related to extroversion were reverse coded prior to the analysis. There are 212 = 4,096 possible response patterns. These data are sparse and Pearson’s X2 and the likelihood ratio G2 statistic yield conflicting conclusions, both on 4,071 degrees of freedom: X2 = 44,440.03, p = 0 and G2 = 1,363.49, p = 1. Neither asymptotic pvalue is accurate. An accurate p-value for the overall fit of the model can be obtained using Maydeu-Olivares and Joe’s (2005, 2006) M2 statistic reported by IRTPRO. We obtained M2 = 484.2 on 54 degrees of freedom. A 90% confidence interval for an RMSEA based on M2 yields (0.09; 0.11). Maydeu-Olivares and Joe (2014) suggested that values on this RMSEA2 below 0.05 indicate good fit, whereas values below 0.089 represent adequate fit. Since we


26

obtain p (RMSEA2 < 0.089) = 0.026, we conclude that a one-dimensional 2PL model fits these data poorly. Next, we attempt to locate the source of misfit using zi statistics (6) to assess the fit of each item separately and zij statistics (7) to assess the fit of each pair of items separately. The bivariate statistics are displayed in Table 6 in matrix form, with univariate statistics along the diagonal. The statistics that are significant at the 5% level using the Benjamini-Hochberg method are boldfaced. We see some rather large bivariate statistics in this table and we do not observe any pattern. Four of them are larger than 6: for items (8,10), (1,7), (3,9), and (2,12), zij = 11.82, 7.17, 6.52 and 6.5, respectively. All four zij statistics are positive, reflecting that the observed proportion of respondents endorsing these two items is larger than expected under the model and they are most likely caused by similarities in the item stems that are not accounted for by the model. For instance, items 8 and 10 are ‘Can you easily get some life into a rather dull party?’ and ‘Can you get a party going?’. Items 1 and 7 are ‘Are you a talkative person?’ and ‘Are you mostly quiet when you are with other people?’. This kind of dependency is referred to as ‘doublet factor’ in the factor analysis literature, which can be accommodated by incorporating correlated residuals into the model. It is not possible to include correlated residuals within the logistic model fitted here but a bifactor model can be tricked to the same end result. More specifically, we used a bifactor model with one primary dimension and four secondary dimensions, one for each dependency. Each secondary dimension only has two non-zero slopes; both slopes are fixed at one13, and the variance of the latent trait is estimated. Each secondary dimension constrained in this fashion is analogous to a correlated residual in factor analysis. The model provides a close fit to the data: It yields M2 = 157.45 on 50 df, and the 90% confidence interval for RMSEA2 is (0.04; 0.06). -


27 Insert Table 6 about here -

PPO scale of the SPSI-R The female Spanish normative data (Maydeu-Olivares, Rodríguez-Fornells, GómezBenito & D’Zurilla, 2000) are analyzed. The sample size is 692. The scale consists of five five-category items. A typical item is ‘I try to see my problems as challenges’. The response categories are ‘Not at all true of me’, ‘Slightly true of me’, ‘Moderately true of me’, ‘Very true of me’, and ‘Extremely true of me’, coded as 0 to 4 respectively. Thus, there are 55 = 3,125 response patterns. These data are also sparse and Pearson’s X2 and the likelihood ratio G2 statistic yield conflicting conclusions, both on 3,099 degrees of freedom: X2 = 66,054.18, p = 0 and G2 = 1,098.26, p = 1. Neither p-value is accurate. M2 yields 463.11 on 155 degrees of freedom, and the 90% confidence interval for RMSEA2 is (0.05; 0.06). Since p(RMSEA2 < 0.05) = 0.14, we conclude that the hypothesized model provides a close fit to the data. To investigate how the fit could be improved even further, we compute all the zi statistics (9) and zij statistics (10). They are displayed in Table 7. The statistics that are significant at the 5% level using the Benjamini-Hochberg method are boldfaced. There is only one statistically significant univariate statistic in this table, that of item 1. There are five statistically significant bivariate statistics, three of which involve item 1, and this item shows the largest average z statistic. Item 1, ‘When my first attempt to solve a problem fails, I believe if I don't give up, I will eventually succeed’ is considerably longer than the remaining items in this scale and it includes a conditional clause, which we believe explains why this item fits slightly worse than the rest. However, in Table 7 we see that all bivariate statistics and all but one univariate statistics are negative, reflecting that the observed means and cross products are smaller than expected. This suggests to us that there is a general misspecification in the model. To shed more light on the nature of the misspecification, we examine the z statistics for every


28

category in the univariate tables, as well as the z statistics for every pair of categories in the bivariate tables. These are shown in Tables 8 and 9, where we have boldfaced all the statistically significant statistics at the 5% level using the Benjamini-Hochberg method. Table 8 shows that the model tends to overpredict the number of respondents that endorse the highest category. In Table 9 we see that for items 1 and 3 the model underpredicts the association between the lowest category of the item and the highest category of the remaining items. This indicates that the logistic function used in the fitted IRT model is slightly misspecified. A model with higher expected frequencies in the tails is needed. The IRT models with copulas introduced by Nikoloulopoulos and Joe (in press) can lead to higher or lower probabilities in the tails than the logistic function, depending on the choice of copula. For higher expected frequencies in the tails than the logistic graded model, a model with a bivariate t distribution with 2 degrees of freedom copula can be employed. Fitting this model with a single latent trait yields an overall M2 statistic of 195.3 on 155 df, p = 0.016, and an 90% CI of the RMSEA2 of (0.01; 0.03). The z statistics have enable us to find an alternative model that provides an excellent fit to these data according to the criteria of Maydeu-Olivares and Joe (2014), RMSEA2 < 0.05/(K – 1) = 0.0125. In this polytomous example, Mij statistics could have been used to locate the misfitting items as an alternative to the zij statistics displayed in Table 7. Mij statistics are displayed in Table 10 in matrix form. Using a chi-square reference distribution with 52 – 2  5 – 1 = 14 degrees of freedom, all bivariate Mij statistics are statistically significant at the 5% level using the Benjamini-Hochberg method. This supports our previous conclusion that although the chosen model provides an overall close fit to the data, there is misspecification everywhere. In Table 10 we have also included the average of all Mij statistics involving each item, which suggests that the worst fitting item is item 1, in agreement with the piecewise assessment performed using the z statistics. However, note that the largest statistic Mij corresponds to the


29

pair (2,1), whereas the largest zij statistic involves the pair (3,4). Also, more Mij than zij statistics are statistically significant in this example, which suggests that the former have higher power to detect the misspecification present in these data. For comparison, X ij2 statistics are also displayed in Table 10. In this example, the X ij2 statistics lead to the same conclusion as the Mij statistics. Put simply, the X ij2 statistics are larger than the Mij statistics, leading to an impression of poorer fit. However, it is not hard to find applications in which Mij and X ij2 lead to different substantive conclusions. Insert Tables 7 to 10 about here Concluding remarks

When fitting models to multivariate discrete data, statistics for items and item pairs are more informative than statistics for response patterns in suggesting ways to modify the model and obtain a better fit. We have described two methods that can enable researchers to assess these fits: z statistics, and M statistics. Both statistics have excellent empirical Type I errors in small samples. M statistics applied to univariate and bivariate margins are a generalization of Pearson’s X2. Unlike X2, M statistics are asymptotically chi-square distributed when applied to univariate and bivariate margins. Lack of degrees of freedom is a problem with these statistics. Thus, M statistics can hardly ever be used to assess model fit to single items. When the data are binary, often they cannot be applied to assess the fit to pairs of items either; only the fit of the model to triplets of variables can be assessed, which hinders the interpretation of the results obtained. In contrast, z statistics follow an asymptotic standard normal distribution and as a result they overcome this limitation of M statistics. However, unlike M statistics, z statistics require computation of the covariance matrix of the full set of item parameter


30

estimates. Hence, the behavior of z statistics will depend on how accurately this matrix is estimated. In this paper, we have used the expected information matrix to estimate this matrix (see the Appendix for details). The observed information matrix or the cross-product approximation need to be used in tests involving a large number of possible response patterns. Liu and Maydeu-Olivares (2014) found the performance of z statistics to be similar to the one reported here when observed information is used, but very poor when the crossproducts approximation was used. However, future research should investigate the Type I errors of these statistics for other models, such as multidimensional IRT models. The choice between statistics with accurate empirical Type I errors should be based on power. The z statistics are more powerful than M statistics against the alternatives reported in this paper and others we have investigated (presence of guessing, guessing and mixtures, guessing and multidimensionality). For these alternatives, Liu and Maydeu-Olivares (2014) found that z statistics generally are more powerful than alternative residual based statistics described in the Appendix. However, there are many more alternatives of interest we have not considered, such as the presence of doublets. For this misspecification, score tests (Glas & Suárez-Falcón, 2003; Liu & Thissen, 2012, in press) are an attractive alternative as they are designed to capture this specific type of misfit. Further research is needed to investigate the performance of test statistics for items and item pairs against a larger variety of alternatives. In particular, future research should compare the performance of z statistics and score tests in identifying doublets. However, no simulation study can mimic the variety of possible departures from the fitted model encountered in applications. Furthermore, the source of misfit in applications is unknown. For this reason, our final recommendation is that in applications more than one statistic should be used, as different statistics have different power against different alternatives and may suggest different re-specifications of the model – one of which,


31

hopefully, will lead to a better fitting model. Statistics with unknown sampling distributions (or with known sampling distributions and poor empirical Type I errors) should be avoided, however, as they are likely to suggest model modifications that will not result in improvements, or (worst case scenario) they may lead to discarding perfectly good items.


32 References

Agresti, A. (2002). Categorical data analysis (2nd ed.). New York: Wiley. Bartholomew, D. J., & Leung, S. O. (2002). A goodness of fit test for sparse 2p contingency tables. British Journal of Mathematical and Statistical Psychology, 55, 1–15. Bartholomew, D.J. & Tzamourani, P. (1999). The goodness of fit of latent trait models in attitude measurement. Sociological Methods and Research, 27, 525-546. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B, 57, 289–300. Bock, R.D., & Lieberman, M. (1970). Fitting a response model for n dichotomously scored items. Psychometrika, 35, 179–197. Bock, R.D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459. Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2001). A mixture item response model for multiple-choice data. Journal of Educational and Behavioral Statistics, 26(4), 381409. Cagnone, S. & Mignani, S. (2007). Assessing the goodness of fit of a latent variable model for ordinal data. Metron. International Journal of Statistics, 65, 337-361. Cai, L., Maydeu-Olivares, A., Coffman, D. L., & Thissen, D. (2006). Limited-information goodness-of-fit testing of item response theory models for sparse 2n tables. British Journal of Mathematical and Statistical Psychology, 59, 173–194. Cai, L., Thissen, D. & du Toit, S. H. C. (2011). IRTPRO. Chicago, IL: Scientific Software International. Chen, W-H & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22, 265-289.


33

D’Zurilla, T. J., Nezu, A. M., & Maydeu-Olivares, A. (2002). Manual of the Social ProblemSolving Inventory-Revised. North Tonawanda, NY: Multi-Health Systems, Inc. Eysenck, S.B.G., Eysenck, H.J., & Barrett, P.T. (1985). A revised version of the Psychoticism scale. Personality and Individual Differences, 6, 21-29. Forero, C.G. & Maydeu-Olivares, A. (2009). Estimation of IRT graded models for rating data: Limited vs. full information methods. Psychological Methods, 14, 275-299. Glas, C.A.W., & Verhelst, N.D. (1995). Testing the Rasch model. In G.H. Fischer & I.W. Molenaar (Eds.), Rasch models. Their foundations, recent developments and applications (pp. 69–96). New York: Springer. Glas, C.A.W., & Suárez-Falcón, J. C. (2003). A comparison of item-fit statistics for the threeparameter logistic model. Applied Psychological Measurement, 27, 87-106. Joe, H. & Maydeu-Olivares, A. (2010). A general family of limited information goodness-offit statistics for multinomial data. Psychometrika, 75, 393-419. Koehler, K. & Larntz, K. (1980). An empirical investigation of goodness-of-fit statistics for sparse multinomials. Journal of the American Statistical Association, 75, 336–344. Liu, Y., & Maydeu-Olivares, A. (2013). Local dependence diagnostics in IRT modeling of binary data. Educational and Psychological Measurement, 73(2), 254-274. Liu, Y. & Maydeu-Olivares, A. (2014). Identifying the source of misfit in item response theory models. Multivariate Behavioral Research, 49, 354-371. Liu, Y., & Thissen, D. (2012). Identifying local dependence with a score test statistic based on the bifactor logistic model. Applied Psychological Measurement, 36, 670–688. Liu, Y., & Thissen, D. (in press). Comparing score tests and other local dependence diagnostics for the graded response model. British Journal of Mathematical and Statistical Psychology. Maydeu-Olivares, A., Hernández, A., & McDonald, R. P. (2006). A multidimensional ideal


34

point item response theory model for binary data. Multivariate Behavioral Research, 41(4), 445–472. doi:10.1207/s15327906mbr4104_2 Maydeu-Olivares, A., & Joe, H. (2005). Limited and full information estimation and goodness-of-fit testing in 2n contingency tables: A unified framework. Journal of the American Statistical Association, 100, 1009–1020. Maydeu-Olivares, A., & Joe, H. (2006). Limited information goodness-of-fit testing in multidimensional contingency tables. Psychometrika, 71, 713–732. Maydeu-Olivares, A. & Joe, H. (2014). Assessing approximate fit in categorical data analysis. Multivariate Behavioral Research. 49, 305-328. Maydeu-Olivares, A., Rodríguez-Fornells, A., Gómez-Benito , J. & D'Zurilla, T.J. (2000). Psychometric Properties of the Spanish Adaptation of the Social Problem-Solving Inventory-Revised (SPSI-R). Personality and Individual Differences, 29, 699-708. Maydeu-Olivares, A. (2015). Evaluating fit in IRT models. In Steven P. Reise & Dennis A. Revicki (Eds.). Handbook of Item Response Theory Modeling: Applications to Typical Performance Assessment (pp. 111-127). New York: Routledge. McKinley, R., & Mills, C. (1985). A comparison of several goodness-of-fit statistics. Applied Psychological Measurement, 9, 49-57. Muthén, B. (1993). Goodness of fit with categorical and other non normal variables. In K.A. Bollen & J.S. Long [Eds.] Testing structural equation models (pp. 205-234). Newbury Park, CA: Sage. Nikoloulopoulos, A. & Joe, H. (in press) Factor copula models for item response data. Psychometrika Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50-64. Orlando, M., & Thissen, D. (2003). Further investigation of the performance of S-X2: An


35

item-fit index for use with dichotomous item response theory models. Applied Psychological Measurement, 24, 50-64. R Development Core Team. (2012). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Reise, S. P. (1990). A comparison of item-and person-fit methods of assessing model-data fit in IRT. Applied Psychological Measurement, 14(2), 127-137. Reiser, M. (1996). Analysis of residuals for the multinomial item response model. Psychometrika, 61, 509–528. Reiser, M. (2008). Goodness-of-fit testing using components based on marginal frequencies of multinomial data. British Journal of Mathematical and Statistical Psychology, 61, 331–360. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometric Monograph No. 17. Satorra, A., & Bentler, P. (1994). Corrections to test statistics and standard errors in covariance structure analysis. In A. Von Eye & C. C. Clogg (Eds.), Latent variable analysis. Applications for developmental research (pp. 399–419). Thousand Oaks, CA: Sage. Takane, Y. & de Leeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables Psychometrika, 52, 393-408. Tay, L. & Drasgow, F. (2011). Adjusting the adjusted X2/df ratio statistic for dichotomous item response theory analyses: Does the model fit? Educational and Psychological Measurement, 72, 510-528. Thissen, D., Steinberg, L., & Kuang, D. (2002). Quick and easy implementation of the Benjamini-Hochberg procedure for controlling the false positive rate in multiple comparisons. Journal of Educational and Behavioral Statistics, 27, 77–83.

ITEM FIT IN DISCRETE DATA Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51(4), 567–577. Toribio, S. G., & Albert, J. H. (2011). Discrepancy measures for item fit analysis in item response theory. Journal of Statistical Computation and Simulation, 81, 1345-1360. Woods, C. M. (2006). Ramsay-curve item response theory to detect and correct for nonnormal latent variables. Psychological Methods, 11, 253-270.

36


37 Appendix

Consider modeling N observations on n discrete random variables Yi, i = 1, …, n. The observed responses can be gathered in a n-dimensional contingency table with C  K n cells. We use  and p, respectively, to denote the C dimensional vectors of probabilities and proportions. We also write π(θ) to indicate that the C dimensional probability vector has

some parametric form that depends on some q parameter vector  to be estimated from the data. The null and alternative hypotheses are H 0    ( ) vs. H1    ( ) . Matrix expressions for X ij2 and M ij and alternative quadratic form statistics We write X ij2 given in (5) using matrix notation as ˆ 1  p  ˆ  . X ij2  N  pij  ˆ ij  D ij ij ij

(16)

When both items have K response categories, pij and ˆ ij are the Cij = K2 dimensional vectors

ˆ  diag( ˆ ) . The bivariate of proportions and estimated probabilities for items i and j, and D ij ij probabilities only involve a qij-dimensional subset of all parameters, denoted as ij, and we write πˆ ij  πij (θˆ ij ) where θˆ ij is the ML estimate. For the one-dimensional IRT graded model considered here, ij amounts to a set of two slopes and 2  (K – 1) intercepts. Similarly, we write Mij as ˆ  p  ˆ  , M ij  X ij2  N  pij  ˆ ij  U ij ij ij



Uij  Dij1Δij ΔijDij1Δij



1

ΔijDij1 ,

(17)

where Δij  πij  θij  θij denotes the Cij  qij matrix of derivatives of the bivariate probabilities with respect to the parameters involved in the bivariate subtable. Provided Δij is of full rank (i.e. θij is estimable only from pij), M ij is asymptotically distributed as a chisquare distribution with degrees of freedom Cij – qij – 1 for any consistent estimator (Maydeu-Olivares & Joe, 2006).


38

Cagnone and Mignani (2007) proposed a similar statistic for assessing the fit of a model to a bivariate subtable:



ˆ  12 D ˆ  12 ˆ D ˆ  12 GFfit ( ij )  N  pij  ˆ ij  D ij ij ij ij





ˆ  12  p  ˆ  D ij ij ij

(18)

In (18) N  ij is the asymptotic covariance matrix of the residuals for that pair of items and + denotes a Moore-Penrose generalized inverse. The asymptotic distribution of GFfit ( ij ) is chisquare with degrees of freedom equal to the rank of Dij

1

2

D

 12 ij

 ij Dij

1

2





Dij 2 . 1

Liu and Maydeu-Olivares (2014) evaluated the performance of the simpler statistic Rij  N  pij  ˆ ij  ˆ ij  pij  ˆ ij  ,

(19)

whose asymptotic distribution is chi-square with degrees of freedom equal to the rank of  ij . Computing  ij An estimate of the Cij  Cij matrix  ij is not needed to compute Mij but it is needed to compute the z statistics as well as GFfit ( ij ) and Rij . For the ML estimator, the asymptotic distribution of the cell residuals

N  p  ˆ  is normal with mean zero and covariance matrix

  diag        1  where  1 denotes the asymptotic covariance matrix of all the

model parameter estimates, distribution of

N ( ˆ  ) . Consider now a bivariate subtable. The asymptotic

N  pij  ˆ ij  is normal with mean zero and covariance matrix  ij  diag ( ij )  ij ij   ij   1   ij ij

(20)

since  pij  ˆ ij   Tij  p  ˆ  , where Tij is a Cij  C matrix of 1s and 0s that maps the cell residuals into the marginal bivariate table. In (20),   1  denotes the submatrix of ij

 1 corresponding to the parameter vector ij . In practice, pij , ˆ ij and  ij in (20) are

computed directly, without the use of the transformation matrix Tij . The Tij matrix is used


39

here simply to show the relation to cell residuals p  ˆ . In all the above expressions ij and  ij are evaluated at the parameter estimates, and the information matrix,  , can be approximated in different ways. For multinomial models estimated using ML, one can approximate this matrix using the expected information matrix

 E   D 1 ,

(21)

However, the expected information matrix can only be computed for models with a limited number of possible response patterns C, as  is of dimension C and  is of dimension C  q. For larger models, one can use the observed information matrix or the cross-products approximation to the information matrix since in both cases only observed patterns are involved. The observed information matrix involves second order derivatives, whereas the cross-products approximation is

 XP   O diag  pO O2   O .

(22)

In (22), pO and O denote the proportions and probabilities of the CO observed patterns and  O is the CO  q matrix of derivatives of the observed patterns with respect to the model parameters (e.g., Bock & Lieberman, 1970). In this paper we use the expected information (21). Analogously, for a univariate subtable the asymptotic distribution of

N  pi  ˆ i 

where ˆ i : i  i  evaluated at the MLE is normal with mean zero and covariance matrix  i  diag  i   i i   i   1   i i

(23)

since  pi  ˆ i   Ti  p  ˆ  , where now Ti is a Ci  C matrix that maps the cell residuals into the marginal univariate table. In (23)   1  denotes the submatrix of  1 corresponding to i

the parameter vector i and Δi  πi  θi  θi denotes the Ci  qi matrix of derivatives of the


40

univariate probabilities with respect to the parameters involved in the univariate subtable. zi and zij statistics for polytomous data

We now provide the formulas needed to compute the SEs involved in the univariate and bivariate z statistics for polytomous ordinal data (9) and (10). The residual mean

 ki  ˆ i 

is obtained by “concentrating” the information contained in the univariate residual

proportions using  ki   i   v i  pi  ˆ i  , with v i   0,1,2,, K i  1 . Therefore, the asymptotic standard error of the residual mean is

i / N , where

i  v i i v i .

(24)

Similarly, the residual cross-product is obtained from the vector of bivariate residual proportions using  kij   ij   v ij  pij  ˆ ij  , with v ij   0  0,0  1,,0  K j  1,1  0,1  1,,1  K j  1,, Ki  1  0, Ki  1  1,, Ki  1  K j  1 . Therefore, the asymptotic standard error of the residual mean is

ij / N , where

ij  v ij ij v ij .

(25)

zi and zij statistics for binary data

For binary variables, equations (24) and (25) simplify. The population mean is  i  Pr Yi  yi   i and (24) can be written as i  i 1  i    i   1   i , i

(26)

where Δi  i  θi  θi is a 1  qi vector of derivatives. Also, the population cross-product is  ij  Pr Yi  yi ,Y j  y j   ij and (25) can be written as ij  ij 1  ij    ij   1   ij , ij

(27)

where Δij  ij  θij  θij is a 1  qij vector of derivatives. Equations (26) and (27) can be


41

used to obtain the SEs involved in equations (6) and (7) for the MLE. z statistics for specific categories of univariate and bivariate tables We give results for a univariate table; results for a bivariate table are very similar. The z statistics for the set of dimension K of univariate residuals can be written in matrix form as zi 

 pi  ˆ i 

.

 

vecdiag ˆ i / N

The scalar form of (28) is given in (12), zil 

pil  ˆ il SE ( pil  ˆ il )

(28)

where, asymptotically,

SE ( pil  ˆ il )  il / N with





il  il 1   i   il    1   il . l

(29)

il

Test statistics for assessing the overall fit of the model We now provide formula for the overall goodness of fit statistics described in the article. Maydeu-Olivares and Joe (2005, 2006)’s M2 statistic can be written as ˆ ( p  ˆ ) , M 2  N ( p 2  ˆ 2 )C 2 2 2

C2   21   21 2   2  21 2   2  21 . 1

(30)

In (30), p2 are the set of univariate and bivariate proportions that do not include category zero and ˆ 2 are the expected probabilities. There are n( K  1) univariate proportions that do not include category zero and ( n( n  1) / 2)( K  1) 2 bivariate proportions. Thus, the total number of elements in p2 is s  n( K  1)  ( n( n  1) / 2)( K  1) 2 . Also, in (30) N  2 is the asymptotic covariance matrix of p2 and  2   2 ( )  is the matrix of derivatives of the probabilities involved in (30) with respect to the model parameters. Provided 2 is of full rank (i.e., the model is estimable from univariate and bivariate information alone), the asymptotic distribution of M2 is chi-square with s – q degrees of freedom. In M2 the proportions and probabilities that involve category zero are excluded for computational


42

convenience as the probabilities within each univariate and bivariate table must add up to one and therefore there is a redundancy in each table. For binary data, Reiser (1996) introduced a statistic that differs from M2 in terms of the weight matrix used R2  N ( p 2  ˆ 2 ) 2 ( p 2  ˆ 2 ) ,

 2   2   2  1  2 .

(31)

Reiser’s statistic follows an asymptotic chi-square distribution with degrees of freedom equal to the rank of  2 (see also Reiser, 2008). A statistic similar to R2 but suitable for polytomous data has been proposed by Cagnone and Mignani (2007). Mij = 0 when degrees of freedom are zero

In this subsection, we show that Mij = 0 when df = 0. An analogous procedure can be used to show that any member of the M family of statistics for multivariate discrete data (Maydeu-Olivares & Joe, 2005, 2006) takes the value of zero when df = 0. Let e ij  p ij  ˆ ij be the residuals pij  ˆ ij excluding the response pattern where all items take the value zero. Thus, e ij is of dimension K2 – 1. Maydeu-Olivares and Joe (2006) showed that since the sum of all the K2 bivariate cell probabilities ij must equal zero, Xij2 can be algebraically rewritten as X ij2  Ne ij ˆ ij1e ij

(32)

and that Mij can be algebraically rewritten as ˆ e , M ij  Ne ij C ij ij



Cij   ij1   ij1Δ ij Δ ij ij1Δ ij



1

Δ ij ij1 ,

(33)

where  ij  N acov(p ij   ij ) , and Δ ij   ij ( ij ) ij , a K2 – 1 by qij matrix of rank qij by assumption. Now, when df = 0, qij = K2 – 1 and Δ ij has an inverse. Therefore,




Δ ij ij1Δ ij



1

43

 Δ ij1 ij Δ ij1 , Cij   ij1   ij1 , and M ij  X ij2  X ij2  0 . In contrast, when df = 0

for testing with Mij, X ij2  0 with equality holding if e ij  0 . Some results for the exchangeable items model To shed some light into the formula provided in this Appendix, we show below how they apply when testing the fit of the exchangeable items model using only univariate information. Consider an exchangeable model for n items. We note that under the model, i  i ( )  Pr Yi  1   , for all items. Similarly, ij  Pr Yi  1, Yi  1  2 . The

1 n corresponding sample proportions are denoted by pi, and pij. The MLE is ˆ ML   pi , the n i 1 inverse of the expected information matrix (21) is

 E1  (ΔD 1Δ) 1  n 1 (   2 )

(34)

and the asymptotic standard error of the MLE can be obtained as SE ( ˆ )   E1 / N  (   2 ) / ( nN ) .

Now consider testing the fit of the model to each of its items. To do so, we can use the zi statistic given in (6) . For this model, when the MLE is used and its asymptotic variance is estimated using the expected information matrix, SE ( pi  ˆ i )  (n  2  ( n  1)3 ) / ( nN ) . This standard error is obtained as follows: let

i   Pr(Yi  0), Pr(Yi  1)     i , i ) , where for this model i  (1  i , i ) , and let pi  (1  pi , pi ) be the vector of observed proportions for item i. We have Δi  πi ( i ) i  ( 1,1) . The asymptotic covariance matrix of

N ( pi  ˆ i ) ,  i , is given

in (23). Then, SE ( pi  ˆ i )   i(2,2) / N , where  i(2,2) denotes the (2,2) element of  i , and we have used (34).


44

Because this model can be estimated using only univariate information, its overall fit can also be tested using univariate information. Let ˆ ( p  ˆ ) , M 1  N ( p1  ˆ 1 )C 1 1 1

C1  11  111  1111  111 . 1

(35)

In (35), 1   Pr(Y1  1), Pr(Y2  1),, Pr(Yn  1)  , 1  1 ( ) /  , and 1  acov N ( p1  1 ) has diagonal elements i (1  i ) and off diagonal elements

ij  i  j . For the exchangeable items model, 1  1 , 1  1 , 1 is a diagonal matrix with elements   2 , and C1  11   n(   2 )  11 . For this model, Pearson’s X2 applied to a 1

n

univariate margin equals X i2  N ( pi  ˆ )2 / (ˆ  ˆ 2 ) . Therefore, for this model, M 1   X i2 . i

For any consistent and asymptotically normal estimator, M1 is asymptotically distributed as a chi square with df = n (K – 1) – q. Thus, for this model M 1  2n 1 .


45 Footnotes

1

They are referred to as cell residuals because the observations can be placed in a

contingency table with 23 cells. 2

We assume that all items consist of the same number of categories simply to

simplify our exposition. The methods described here do not require this. 3

Its asymptotic distribution is a mixture of independent chi-square variables, each

with one degree of freedom. 4

For instance, item 3 in Table 1 can take two values, 0 and 1. The observed frequency

of observing 0 is obtained by summing the observed frequencies of the response patterns where item 3 takes the value 0. These are patterns 000, 100, 010, 110. Similarly, one can obtain the marginal distribution of items 1 and 2. In this case, four possible responses can be obtained (00, 10, 01, 11). The observed frequency for response 10, for instance, is obtained by summing the observed frequencies of the response patterns where item 1 equals 1 and item 2 equals 0: patterns 100 and 101. 5

Of course, X i2 is distributed as a chi-square if ˆ ML  ˆ i , and X ij2 is distributed as a

chi-square if ˆ ML  ˆ ij . 6

A table involving four items with five response categories each yields 45 = 1,024

response patterns. Although this can be considered a large model from a statistical viewpoint, it is rather small from a psychological modeling viewpoint. 7

In IRT terminology, the usual ML estimator is often called marginal maximum

likelihood (MML). 8

Estimated factor loadings for the EPQ-R application range from 0.55 to 0.82 with an

average of 0.72; absolute values of the thresholds range from 0.2 to 1.2 with an average of


46

0.55. 9

For one item there are two parameters involved, for two items there are four

parameters (two intercepts and two slopes). Therefore degrees of freedom for testing single items are 2 – 2 – 1 = -1, and for testing pairs of items 4 – 4 – 1 = -1. 10

Three intercepts and a slope parameter are used to model each item with 4 response

alternatives. Therefore, there are 42 – 2 11

4 – 1 = 7 degrees of freedom.

Larger sample sizes would be needed for highly skewed items, as in this case item

parameters are more poorly estimated (Forero & Maydeu-Olivares, 2009). 12

We use the notation Mijk and X ijk2 to indicate that triplets of items are used.

13

When the zij statistic is negative, one slope is to be fixed and 1 and the other at -1

(Maydeu-Olivares, 2015).


47

Table 1 A small illustrative example with three binary variables: Response patterns, observed frequencies, and probabilities under a model that assumes that the items are exchangeable

Response

Observed

patterns

frequencies

000

206

100

Probabilities

Expected

z statistics for

frequencies

cell residuals

(1  )3

201.92

0.47

159

(1  ) 2

142.26

1.53

010

129

(1  ) 2

142.26

-1.21

001

134

(1  ) 2

142.26

-0.75

110

116

2 (1  )

100.23

1.74

101

106

2 (1  )

100.23

0.64

011

76

2 (1  )

100.23

-2.67

111

74

3

70.62

0.51

Note: N = 1,000;  denotes the probability of endorsing an item; an exchangeable model assumes that the probability of endorsing an item is the same for all items. The model was estimated by maximum likelihood.


48

Table 2 Exchangeable items example

X ij2 , Mij, and zij statistics for pairs of variables

Pair

X ij2

p

Mij

p

zij

p

1,2

10.084

0.006

8.961

0.011

2.061

0.039

1,3

9.447

0.009

8.754

0.013

0.985

0.325

2,3

3.854

0.146

0.275

0.872

-2.242

0.025

zi statistics for single items Items

zi

p

1

3.277

0.001

2

-1.442

0.149

3

-1.835

0.066

Note: N = 1,000. Asymptotically, the z statistics follow a standard normal distribution, and Mij a chi-square distribution with 2 df. P-values for X ij2 were computed using a 2 distribution with 2 df and they are incorrect.


49

Table 3 Bivariate diagnostic statistics for correctly specified models: Empirical rejection rates of zij , Mij, and X ij2 at  = 0.05.

K=2 pair

K=4

N

zij

Mij

X ij2

zij

100

0.05

0.04

0.08

0.04

1000

0.04

0.05

0.09

0.06

100

0.05

0.06

0.09

0.06

1000

0.06

0.05

0.10

0.04

100

0.05

0.05

0.09

0.05

1000

0.05

0.05

0.10

0.04

1,2

1,3

2,3

Notes: N = sample size. The fitted model was a correctly specified graded logistic model for 2 and 4 category items. Empirical rejection rates should be as close as possible to 0.05. When K = 2, the model is a 2PL model. There are 7 df for Mij. P-values for Xij2 were obtained using a chi-square distribution with 7 df.


50

Table 4 Diagnostic statistics for misspecified models: Empirical rejection rates at  = 0.05 and N = 1000. Univariate statistic: zi item

K=2

K=4

3-dim Mix 1SD Mix 3SD 2-dim Mix 1SD Mix 3SD

1

0.06

0.06

0.84

0.06

0.05

0.94

2

0.06

0.07

0.82

0.05

0.03

0.96

3

0.06

0.06

0.83

0.05

0.05

0.94

Bivariate statistics: zij and Mij K=2 pair

K=4

3-dim

Mix 1SD

Mix 3SD

2-dim

Mix 1SD

Mix 3SD

zij

zij

zij

Mij

zij

Mij

zij

Mij

zij

1,2

1.00

0.04

0.82

0.19

0.90

0.04

0.04

0.56

0.99

1,3

1.00

0.07

0.84

0.09

0.82

0.05

0.05

0.49

0.98

2,3

0.99

0.06

0.70

0.07

0.63

0.04

0.04

0.45

0.93

1,4

0.66

0.04

0.84

0.04

0.97

0.05

0.04

0.59

1.00

2,5

0.67

0.07

0.74

0.04

0.90

0.05

0.05

0.55

0.98

3,6

0.42

0.05

0.39

0.07

0.66

0.06

0.04

0.36

0.85

Notes: The fitted model was a one-dimensional graded logistic model for K = 2 and 4 category items. When K = 2, the model is a 2PL model. Data were generated using a) a multidimensional graded model, b) a mixture (50/50) of two 2PL/graded models with one standard deviation latent trait mean difference, and c) a 50/50 mixture with a three standard deviation latent trait mean difference. Empirical rejection rates should be as large as possible. There are 7 df for Mij.


51

Table 5 Diagnostic statistics for misspecified models: Empirical rejection rates at  = 0.05 and N = 1000 of z statistics for univariate and bivariate cell residuals in polytomous items.

Univariate cell residuals item 1

2

3

cats 0 1 2 3 0 1 2 3 0 1 2 3

2-dim 0.05 0.05 0.05 0.06 0.05 0.06 0.05 0.06 0.05 0.05 0.05 0.06

Mix 1SD 0.06 0.05 0.06 0.05 0.05 0.04 0.06 0.04 0.06 0.05 0.05 0.05

Bivariate cell residuals Mix 3SD 0.42 0.97 0.09 0.97 0.41 0.97 0.08 0.97 0.40 0.97 0.08 0.98

items

1,2

2,3

cats 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3

2-dim 0.68 0.07 0.42 0.60 0.05 0.22 0.09 0.39 0.39 0.11 0.23 0.05 0.62 0.41 0.05 0.69 0.55 0.06 0.21 0.53 0.06 0.09 0.06 0.19 0.20 0.07 0.11 0.05 0.54 0.21 0.07 0.54

Mix 1SD 0.04 0.04 0.05 0.04 0.05 0.04 0.06 0.05 0.04 0.04 0.05 0.04 0.04 0.05 0.05 0.04 0.06 0.05 0.06 0.05 0.06 0.05 0.05 0.04 0.06 0.05 0.06 0.05 0.05 0.05 0.05 0.04

Mix 3SD 0.58 0.33 0.28 0.13 0.37 0.20 0.14 0.06 0.29 0.12 0.08 0.32 0.07 0.10 0.36 0.43 0.52 0.21 0.23 0.12 0.31 0.16 0.10 0.06 0.20 0.10 0.05 0.26 0.06 0.10 0.28 0.16

Notes: The fitted model was a one-dimensional graded logistic model for 4-category items. Data was generated using a) a multidimensional graded model, b) a mixture (50/50) of two graded models with one standard deviation latent trait mean difference, and c) a 50/50 mixture with a three standard deviation latent trait mean difference. Empirical rejection rates should be as large as possible.


52

Table 6 EPQ data: univariate and bivariate z statistics

item 1

1 -1.08

2 0.34

3 -4.72

4 -2.69

5 -0.01

6 -0.63

7 7.17

8 -3.95

9 -1.89

10 -4.65

11 -1.25

12 -0.06

Average 2.37

2

0.34

-0.06

-0.51

-1.11

-3.45

-0.53

-2.23

-3.04

-2.66

-3.74

-0.49

6.52

2.06

3

-4.72

-0.51

-0.23

0.64

2.19

-2.94

-2.27

-0.25

1.89

0.06

3.00

-2.45

1.76

4

-2.69

-1.11

0.64

2.68

-0.45

2.18

-0.66

-4.19

6.52

-2.81

1.21

-3.17

2.36

5

-0.01

-3.45

2.19

-0.45

-2.02

-1.01

3.15

-2.94

-0.94

-3.00

-0.71

-3.17

1.92

6

-0.63

-0.53

-2.94

2.18

-1.01

-1.70

-0.26

-0.08

1.30

-0.86

-0.78

-2.05

1.19

7

7.17

-2.23

-2.27

-0.66

3.15

-0.26

-1.03

-4.66

-1.37

-5.13

-2.86

-1.59

2.70

8

-3.95

-3.04

-0.25

-4.19

-2.94

-0.08

-4.66

-3.62

-5.57

11.82

-3.05

-1.27

3.70

9

-1.89

-2.66

1.89

6.52

-0.94

1.30

-1.37

-5.57

2.51

-2.79

4.46

-3.51

2.95

10

-4.65

-3.74

0.06

-2.81

-3.00

-0.86

-5.13

11.82

-2.79

-3.24

-2.72

-3.17

3.67

11

-1.25

-0.49

3.00

1.21

-0.71

-0.78

-2.86

-3.05

4.46

-2.72

-0.70

0.10

1.78

12

-0.06

6.52

-2.45

-3.17

-3.17

-2.05

-1.59

-1.27

-3.51

-3.17

0.10

-1.56

2.39

Notes: Univariate z statistics in the diagonal. We have boldfaced the z statistics significant at an overall 5% level using the Benjamini-Hochberg procedure. The average column is the row average of absolute values of z statistics.


53

Table 7 PPO data: univariate and bivariate z statistics

Item

1

2

3

4

5

Average

1

-3.31

-3.29

-2.10

-3.83

-2.48

3.00

2

-3.29

1.29

-1.62

-0.10

-2.00

1.66

3

-2.10

-1.62

-2.00

-4.41

-1.76

2.38

4

-3.83

-0.10

-4.41

-2.10

-2.68

2.62

5

-2.48

-2.00

-1.76

-2.68

-1.35

2.06

Notes: Univariate z statistics in the diagonal. We have boldfaced the z statistics significant at an overall 5% level using the Benjamini-Hochberg procedure. The average column is the row average of absolute values of z statistics.


54

Table 8 PPO data: z statistics for univariate cell residuals

item / cat.

0

1

2

3

4

1

2.30

0.95

0.65

-0.13

-3.61

2

2.77

-1.61

-2.19

3.57

-3.01

3

1.88

-1.11

2.30

-0.36

-4.62

4

1.38

0.13

1.53

-0.53

-2.94

5

0.66

-1.39

4.15

-1.33

-3.17

Notes: We have boldfaced the z statistics significant at an overall 5% level using the Benjamini-Hochberg procedure.


55

Table 9 PPO data: z statistics for bivariate cell residuals item cat 0 1 2 2 3 4 0 1 3 2 3 4 0 1 4 2 3 4 0 1 5 2 3 4

1

2

0

1

2

3

4

3.21 -1.87 -2.24 -0.14 7.87 2.85 -1.05 -2.13 1.27 4.47 0.87 -0.82 -1.94 -0.03 6.99 1.98 -1.23 -1.61 0.45 5.03

-0.99 0.19 -0.47 -0.46 3.05 0.11 1.77 -0.80 -0.31 -0.76

-1.06 1.73 1.15 -1.49 -0.21 -1.19 -0.74 1.88 -0.44 -0.84

-0.41 -2.16 -1.08 5.03 -4.01 -1.76 -0.04 -0.79 1.90 -1.33

1.84 1.50 0.56 -3.01 1.96 2.59 -0.65 1.33 -2.30 0.96

-0.17 1.38 -1.47 0.51 0.31 0.34 -0.49 1.13 -0.44 -0.56

-0.58 -2.78 3.64 -1.08 -0.14 -0.72 -0.09 1.24 -0.49 -0.45

-0.39 1.18 -0.66 2.99 -4.34 -1.00 -0.10 -0.54 2.43 -2.20

1.46 1.67 -0.98 -3.17 3.35 0.67 0.79 0.48 -2.90 1.61

3

0

1

2

3

4

2.47 -0.81 -0.40 -0.06 2.70 1.45 0.53 -0.40 0.13 -0.45 2.54 -0.68 -1.31 0.79 -0.30

0.22 0.26 -0.16 -0.26 -1.08

-2.19 1.26 -0.23 -0.96 -0.35

-0.22 -3.23 0.25 3.47 -0.60

3.02 3.47 1.15 -3.47 0.28

-1.20 1.02 -0.27 -1.05 0.84 0.57 -0.95 0.06 0.21 0.13

0.17 -3.02 3.58 -2.18 0.16 -0.83 -0.04 -1.02 0.26 1.03

-0.47 1.29 -2.96 5.01 -2.90 -1.61 -1.56 3.26 1.35 -2.19

1.73 1.73 0.57 -3.63 2.71 2.07 3.30 -1.49 -2.59 1.19

4

0

1

2

3

4

0

1

2

3

4

1.88 -0.78 -2.01 0.30 4.51 2.18 -1.33 -0.74 -0.98 4.36

-0.61 -0.58 0.02 -0.70 1.94 -0.36 -0.06 0.03 -0.24 0.38

-0.79 -1.36 1.52 -0.93 1.90 -0.36 -0.93 -0.33 3.16 -1.40

-0.07 1.96 -0.52 3.55 -5.96 -0.20 0.97 2.83 -2.13 -2.81

2.34 1.07 -0.28 -3.91 3.44 0.40 -0.10 -2.28 -1.38 3.65

1.90 -2.19 -1.26 2.11 2.60

0.79 -0.88 -1.42 1.81 1.40

-0.82 0.43 0.33 0.46 -0.71

-1.17 0.58 1.81 -0.65 -2.38

1.78 -0.75 -0.00 -1.65 1.90

Note: We have boldfaced the z statistics significant at an overall 5% level using the Benjamini-Hochberg procedure.


56

Table 10 PPO data: Bivariate Mij and X ij2 statistics

Item

1

2

3

4

5

Average

1

0.

101.06

42.29

84.04

39.97

53.47

2

108.50

0.

44.92

40.76

31.68

43.68

3

49.30

48.21

0.

67.01

50.23

40.89

4

87.07

42.79

71.29

0.

32.61

44.88

5

43.27

32.79

50.88

34.36

0.

30.90

Notes: Mij statistics above the diagonal, X ij2 statistics below the diagonal; df = 14; We have boldfaced the Mij statistics significant at an overall 5% level using the Benjamini-Hochberg procedure. All of them are statistically significant at this level. The average column is the average across of 5 Mij statistics for the item.

ITEM FIT IN DISCRETE DATA 1 Item diagnostics in ...

ITEM FIT IN DISCRETE DATA 1 Item diagnostics in ...

Suggest Documents

Bayesian Item Fit Analysis for Dichotomous Item Response ... - ETS.org

The Correlation Between Item Parameters and Item Fit ... - CiteSeerX

(IPMR) Data Item Description

Lampiran 1 No. Item Point Item Pengungkapan 1. Pengungkapan ...

RUNNING HEAD: Item Selection in CCT Item Selection in ...

Item 1 - Jackson County

Item 1 - Jackson County

Item 1 Cover Page

Item 1 - Jackson County

Issues Affecting Item Response Theory Fit in Language Assessment: A ...

Item Name Item Description Item Code Price Quantity - AVP Business ...

Item Name Item Description Item Code Price Quantity - AVP Business ...

Conditional Item-Exposure Control in Adaptive Testing Using Item

Item Name Item Description Item Code Price Quantity - AVP Business ...

Analysis of differential item functioning in the depression item bank ...

INDEX ITEM# ITEM DESCRIPTION 1. BEEF, RIBEYE ROLL 2. BEEF ...

The STROCSS Guideline Item no. Item description Page Number 1

Item Selection in Polytomous CAT

Amazon.com recommendations item-to-item collaborative filtering ...

Amazon.com recommendations item-to-item collaborative filtering ...

Amazon.com recommendations item-to-item ... - Cs.umd.edu

Evaluating Item-Item Similarity Algorithms for Movies

Functional Requirements Identification Using Item-to-Item ...

Regents Item