bayesian estimation of a multi-unidimensional graded ... - j-stage

2 downloads 0 Views 547KB Size Report
Please address correspaondence to Tzu-Chun Kuo: tckuo@siu.edu ... Albert and Chib (1993) proposed a Gibbs sampler for the unidimensional GRM, and.
Behaviormetrika Vol.42, No.2, 2015, 79–94

BAYESIAN ESTIMATION OF A MULTI-UNIDIMENSIONAL GRADED RESPONSE IRT MODEL Tzu-Chun Kuo∗,∗∗ and Yanyan Sheng∗ Unidimensional graded response models are useful when items are designed to measure a unified latent trait. They are limited in practical instances where the test structure is not readily available or items are not necessarily measuring the same underlying trait. To overcome the problem, this paper proposes a multi-unidimensional normal ogive graded response model under the Bayesian framework. The performance of the proposed model was evaluated using Monte Carlo simulations. It was further compared with conventional polytomous models under simulated and real test situations. The results suggest that the proposed multi-unidimensional model is more general and flexible, and offers a better way to represent test situations not realized in unidimensional models.

1. Introduction Item response theory (IRT; Lord, 1980) models the probabilistic relationship between a person’s latent trait and the test at the item level. It has gained an increasing popularity in large-scale educational and psychological testing situations. Dichotomous IRT models (Birnbaum, 1969; Lord, 1980; Lord & Novick, 1968; Rasch, 1960) apply to cognitive/achievement data where correct/incorrect responses are modeled. In the literature, such models have been extensively studied (e.g. Kang & Cohen, 2007; Rizopoulos, 2006). In the context of both cognitive and affective tests, items are commonly designed to involve more than two responses, for which polytomous models are more applicable. Polytomous responses include nominal and ordinal responses. The former does not have any natural ordering between categories whereas the latter corresponds to a number of ordering categories. Ordinal polytomous responses, such as Likert scale items (Likert, 1932), are broadly used in disciplines such as education, psychology, and marketing, to name a few. Given this, various IRT models have been developed to analyze ordinal polytomous items, such as the graded response model (GRM; Samejima, 1969), the rating scale model (RSM; Andrich, 1978), the partial credit model (PCM; Masters, 1982), and the generalized partial credit model (GPCM; Muraki, 1992). This study focuses on the GRM, as it is the most widely used IRT model for polytomous response data (e.g. Ferero & Maydeu-Olivares, 2009; Rubio et al., 2007). In some circumstances, it suffices to assume that all the test items measure one trait in common and hence use unidimensional IRT models. However, in other situKey Words and Phrases: item response theory, polytomous response model, unidimensional model, multi-unidimensional model, Markov chain Monte Carlo; Bayesian model choice; Hastings-within-Gibbs ∗ Department of Educational Psychology and Special Education Southern Illinois University Carbondale Carbondale, IL 62901-4618, USA ∗∗ Please address correspaondence to Tzu-Chun Kuo: [email protected] Mail Address: [email protected], [email protected]

80

T.-C. Kuo and Y. Sheng

ations where it is a priori clear that multiple abilities are being measured or the test dimensionality structure is not clear, multidimensional IRT (MIRT; Reckase, 2009) models have to be considered. MIRT models were developed for situations where distinct multiple traits are involved in producing the manifest responses for an item. A special case of the MIRT model applies to the situation where the instrument consists of several subscales with each measuring a different latent trait, such as the Minnesota Multiphasic Personality Inventory (MMPI; Buchanan, 1994). In the IRT literature, such a model has been named as the multi-unidimensional model (Sheng & Wikle, 2007) and is the major focus of the study. In IRT, simultaneous estimation of item and person parameters calls for the need of using fully Bayesian estimation via Markov chain Monte Carlo (MCMC) simulation techniques. MCMC methods are extremely general and flexible and have proved useful in practically all aspects of Bayesian inferences, such as parameter estimation or model comparisons. Bayesian procedures have been developed for unidimensional dichotomous (Albert, 1992; Patz & Junker, 1999a,b; Sahu, 2002), multidimensional dichotomous (Beguin & Glas, 2001; Lee, 1995; Sheng & Wikle, 2007, 2008, 2009; Sheng & Headrick, 2012; Yao & Boughton, 2007), and unidimensional GRM (Albert & Chib, 1993; Fox, 2010; Muraki, 1990; Zhu & Stone, 2011) models. Specifically, Sheng and Wikle (2007) described a Bayesian estimation of the multi-unidimensional IRT model for dichotomous items, where a Gibbs sampler was implemented to simultaneously estimate person/item parameters and intertrait correlations. In addition, Albert and Chib (1993) proposed a Gibbs sampler for the unidimensional GRM, and Cowels (1996) (see also Fox, 2010) extended it to include a Metropolis-Hasting (M-H; Hastings, 1970; Metropolis & Ulam, 1949) step. To date, fully Bayesian estimation for the multidimensional logistic GRM is available via the computer program BMIRT (Yao, 2003). However, the M-H algorithm adopted by the program is not ideal in that (1) it requires an adequate proposal distribution for each model parameter, and (2) it fails to directly model the intertrait correlation, which is of interest in many test situations involving multiple latent traits. In view of the above, this study focuses on the IRT model that is a multiunidimensional extension of Samejima’s GRM and proposes a Bayesian estimation for the model based on the algorithms used by Fox (2010) and Sheng and Wikle (2007). The proposed algorithm can simultaneously estimate item/person parameters, as well as intertrait correlations. The performance of the proposed algorithm is investigated using Monte Carlo simulation studies. It is further compared with other conventional models under simulated and real test situations.

2. Models and Precedure 2.1 Models The unidimensional normal ogive graded response model (NOGRM) provides the simplest framework for modeling the person-item interaction for polytomous response

BAYESIAN ESTIMATION OF A MULTI-UNIDIMENSIONAL GRADED RESPONSE IRT MODEL

81

data by assuming one latent dimension. Suppose an instrument consists of K multipleresponse items (e.g. Likert-type items), each measuring a single unified latent trait, θ. With a probit link, the probability that the ith (i = 1, 2, . . . n) person contains a Likert scale response (c) for the jth (j = 1, 2, . . . K) item is defined as P (Yij = c|θi , αj , δj ) = Φ(αj θi − δj,c−1 ) − Φ(αj θi − δj,c )  δj,c = φ(z; αj θi )dz,

(2.1.1)

δj,c−1

where Φ(·) and φ(·) are the standard normal CDF and PDF, respectively, z is a standard normal variate, αj denotes the item discrimination parameter and δj,c (c = 1, 2, . . . , Cj ) denotes the upper item threshold parameter for category c in item j that contains Cj response categories (Samejima, 1969), with the latter of which satisfying −∞ = δj,0 < δj,1 < . . . < δj,Cj −1 < δj,Cj = ∞.

(2.1.2)

The proposed multi-unidimensional NOGRM is an extension of the unidimensional NOGRM for situations where a K-item instrument consists of m subscales, each containing kv polytomous response items that measure one latent dimension. The probability of person i obtaining a response c for item j of the vth (v = 1, 2, . . . m) subscale can be defined as: P (Yvij = c|θvi , αvj , δj ) = Φ(αvj θvi − δj,c−1 ) − Φ(αvj θvi − δj,c )  δj,c φ(z; αvj θvi )dz, =

(2.1.3)

δj,c−1

where αvj and θvi denote the item discrimination and the person’s latent trait in the vth dimension, and δj is as defined in (2.1.1). This model is graphically illustrated in Figure 1, where items in each subscale measure a single latent trait (θ1 or θ2 ), but the test itself is multidimensional. 2.2 MCMC algorithm To implement MCMC to the proposed multi-unidimensional NOGRM as defined in Equation (2.1.3), an augmented continuous variable Z was introduced so that Zvij ∼ N (αvj θvi , 1) (Albert, 1992; Albert & Chib, 1993; Lee, 1995). A multivariate normal prior distribution was considered for latent traits θi = (θ1i , θ2i , . . . θmi ) so that θi ∼ Nm (0, P), where P is a correlation matrix, with 1s on the diagonal and the correlation ρst between θsi and θti , s = t on the off diagonals. Note that when ρst = 1 for all s, t, the model reduces to the unidimensional NOGRM as defined in Equation (2.1.1). In addition, when ρst = 0 for all s, t, the model is equivalent to

82

T.-C. Kuo and Y. Sheng

Figure 1: Graphical illustration of the multi-unidimensional GRM model.

fitting the unidimensional NOGRM for each subscale of the test. With the constraint imposed on P, the proper multivariate normal prior for θvi with their location and scale parameters specified to be 0 and 1, respectively, ensures unique scaling and hence is essential in resolving a particular identification problem for the model (see Lee, 1995; Sheng, 2008, for a detailed derivation and illustration of the procedure). To carry out the Gibbs sampler, an unconstrained covariance matrix Σ, where Σ = [σij ]m×m , was introduced so that the correlation matrix P can be transformed from Σ using ρst = √

σst , σss σtt

s = t

(2.2.1)

(Lee, 1995). Hence, with prior distributions for α, δ, and Σ, the joint posterior distribution of (θ, α, Z, Σ, δ) is p(θ, α, Z, Σ, δ|y) ∝ f (y|Z)p(Z|θ, α, δ)p(α)p(δ)p(θ|P)p(Σ),

(2.2.2)

with the likelihood function being

f (y|Z) =

Cj kv  m  n  

P (Yvij = c)I[yvij =c] ,

(2.2.3)

v=1 i=1 j=1 c=1

where I[X ∈ A] is the indicator function so that I[X ∈ A] = 1 if X is contained in A, and P (Yvij = c) is the probability function as in (2.1.3). It is noted that in developing the MCMC algorithm for the unidimensional GRM, Cowels (1996) suggested a M-H step to generate threshold parameters δj within the Gibbs sampler proposed by Albert and Chib (1993) in order to accelerate the convergence of the Markov chain procedure. This M-H step requires a suitable proposal ∗ : density that generates a new candidate δj,c

BAYESIAN ESTIMATION OF A MULTI-UNIDIMENSIONAL GRADED RESPONSE IRT MODEL (l)

(l)

∗ 2 ∗ ∗ δj,c ∼ N (δj,c , σmh )I(δj,c−1 < δj,c < δj,c+1 )

83

(2.2.4)

(l)

(see also Fox, 2010), where δj,c is the value of δj,c in the lth iteration of the sampler. This Hastings-within-Gibbs approach extends to the multi-unidimensional model so that the MCMC algorithm for updating each model parameter is as follows: 1.

2.

Sample augmented parameters Z from ⎧ ⎪ N(−∞,δj,1 ) (αvj θvi , 1) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ Zvij |· ∼ N(δj,c−1 ,δj,c ) (αvj θvi , 1) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩N(δ (αvj θvi , 1) j,Cj −1 ,∞)

if Yvij = 1 .. . if Yvij = c . .. .

(2.2.5)

if Yvij = Cj

Sample the latent variable parameters θi from θi |· ∼ Nm ((A A + P−1 )−1 A Zi , (A A + P−1 )−1 ), where



α1 ⎢ ⎢0 A=⎢ ⎢ .. ⎣ .

0 α2 .. .

0

0

··· ··· ··· ···

0 0 .. . αm

(2.2.6)

⎤ ⎥ ⎥ ⎥ ⎥ ⎦

,

(2.2.7)

K×m



⎤ Z1i ⎢ ⎥ Zi = ⎣ ... ⎦ .

(2.2.8)

Zmi 3.

Draw candidates δj∗ from the proposal density in (2.2.4). Sample Uj ∼ U (0, 1), (m+1)

and set δj,c

∗ = δj,c for c = 1, 2, . . . , Cj − 1 when



uk ≤ min

∗ ∗  Φ(αvj θvi − δj,y ) − Φ(αvj θvi − δj,y ) vij −1 vij

Φ(αvj θvi − δj,yvij −1 ) − Φ(αvj θvi − δj,yvij )  δ∗ −δj,c −δj,c Cj −1  Φ( δj,c+1 ) − Φ( j,c−1 ) σmh σmh ,1 . × ∗ ∗ ∗ δj,c+1 −δj,c δj,c−1 −δj,c ) − Φ( ) c=1 Φ( σmh σmh i

4.

(2.2.9)

Sample item parameters αvj from αvj |· ∼ N(0,∞) (μαvj , σα2vj ),

(2.2.10)

where σα2vj = (θv θv )−1 , μαvj = σα2vj θv Zvj , assuming a noninformative uniform prior

84

T.-C. Kuo and Y. Sheng

αvj > 0, or σα2vj = (θv θv + 1/σα2 )−1 , μαvj = σα2vj (θv Zvj + μα /σα2 ), assuming a conjugate normal prior αvj ∼ N(0,∞) (μα , σα2 ). 5. Sample the unconstrained covariance matrix Σ from Σ|· ∼ W −1 (S−1 , n),

(2.2.11)

 where W −1 is an inverse Wishart distribution, S = ni=1 (Cθi )(Cθi ) , in which ⎡ k ⎤ 1 ( j=1 α1j )1/k1 0 ··· 0  ⎢ ⎥ 2 α2j )1/k2 · · · 0 0 ( kj=1 ⎢ ⎥ ⎢ ⎥. C=⎢ (2.2.12) .. .. .. ⎥ ⎣ ⎦ . . ··· . m 0 0 · · · ( kj=1 αmj )1/km

6.

Transform Σ to P using (2.2.1).

2.3 Bayesian model choice In the Bayesian framework, the adequacy of the model for describing the data is evaluated using several model choice or checking techniques. In this research, we focus on the Bayesian deviance information criterion (DIC; Spiegelhalter et al., 2002) when evaluating the fit of a given model. DIC is based on the posterior distribution ¯ = E(−2 log L(y|Λ)) is ¯ + pD , where D of the deviance and is defined as DIC = D the posterior expectation of the deviance (with L(y|Λ) being the model likelihood ¯ is the effec¯ − D(Λ) function, where Λ denotes all model parameters) and pD = D ¯ = −2 log(L(y|Λ)), ¯ tive number of parameters (Carlin & Louis, 2000). Further, D(Λ) ¯ where Λ is the posterior mean. From the Hastings-within-Gibbs procedure, posterior ¯ can be approximated samples of parameters, Λ(1) , Λ(2) , . . . , Λ(G) , are drawn and D  G (g) ¯ ))/G. Small values of D ¯ suggest a better model fit. ¯ = (−2 log L(y|Λ as D g=1

Generally, more complicated models tend to provide a better fit. Hence, penalizing for the number of parameters (pD ) makes DIC a more reasonable measure to use.

3. Simulation Studies To demonstrate the proposed methodology, Monte Carlo simulations have been conducted to illustrate item parameter recovery as well as model comparisons. 3.1 Parameter recovery For parameter recovery, tests with two subscales were considered so that the first half measured one latent trait (θ1 ) and the second half measured the other (θ2 ). In the study, three factors were manipulated: sample size (n), test length (K), and intertrait correlation (ρ12 ). Specifically, n polytomous responses (n = 500, 1000) to K items (K = 20, 30) were generated according to the multi-unidimensional NOGRM as

BAYESIAN ESTIMATION OF A MULTI-UNIDIMENSIONAL GRADED RESPONSE IRT MODEL

85

Table 1: Average RMSD (Bias) for recovering α1 in the multi-unidimensional NOGRM. ρ12 K = 20 0.2

K = 30 K = 20

0.5

K = 30 K = 20

0.8

K = 30

Noninformative priors n = 500 n = 1000 0.1315 0.0812 (0.0667) (0.0301) 0.2189 0.1016 (0.1285) (0.0379) 0.1297 0.0814 (0.0655) (0.0118) 0.2150 0.1007 (0.1252) (0.0390) 0.1300 0.0933 (0.0673) (0.0342) 0.1981 0.0988 (0.1149) (0.0395)

Informative priors n = 500 n = 1000 0.1121 0.0751 (0.0309) (0.0204) 0.1597 0.0889 (0.0577) (0.0135) 0.1162 0.0660 (0.0322) (0.01586) 0.1546 0.0885 (0.0537) (0.0153) 0.1128 0.0758 (0.0320) (−0.0064) 0.1557 0.0873 (0.0559) (0.0152)

defined in Equation (2.1.3), where the population correlation between the two latent traits was set to ρ12 = 0.2, 0.5, or 0.8. It is noted that in the GRM, the number of response categories, Cj (j = 1, . . . , K), can be different across items. However, it is common in practice that items have the same response categories. This simulation study hence considered situations where items were measured on three Likert scales (i.e., Cj = 3 for j = 1, . . . , K) so that two threshold parameters were to be estimated for each item. The item discrimination parameters αv were generated randomly from uniform distributions so that αvj ∼ U (0, 2). The threshold parameters δj1 and δj2 were sorted values based on those randomly generated from a standard normal distribution, i.e., δj1 = min(X1 , X2 ) and δj2 = max(X1 , X2 ), where X1 , X2 ∼ N (0, 1). The Hastings-within-Gibbs algorithm was subsequently implemented to recover the model parameters assuming noninformative or informative prior distributions for αv . The 2 , was specified so that the variance of the proposal density in Equation (2.2.4), σmh average acceptance rate was close to 0.5 (cf. Fox, 2010). Each Markov chain had a run length of 10,000 iterations and a burn-in period of 2,000 iterations. Convergence was visually evaluated using running mean plots and trace plots. Ten replications were carried out for each scenario, where root-mean-squared differences (RMSD) and bias (Bias) were used to evaluate the recovery of each item parameter. These values were further averaged across items and are summarized in Tables 1 to 5. Tables 1 and 2 display results for recovering item discrimination parameters for the two subscales, which suggest that different intertrait correlations (ρ12 ) result in similar RMSD and Bias values. Sample sizes and test lengths have different effects on item parameter estimates for this model. Specifically, an increased sample size (n) tends to reduce the average RMSD with both noninformative and informative priors. For example, for estimating α2 (ρ12 = 0.2, K = 20), the RMSD decreased from 0.1565 to 0.0711 when n increased from 500 to 1000 with noninformative priors (from 0.0881 to 0.0753 with informative priors). In terms of the average Bias, most

86

T.-C. Kuo and Y. Sheng

Table 2: Average RMSD (Bias) for recovering α2 in the multi-unidimensional NOGRM. ρ12 K = 20 0.2

K = 30 K = 20

0.5

K = 30 K = 20

0.8

K = 30

Noninformative priors n = 500 n = 1000 0.1565 0.0711 (0.0728) (0.0073) 0.1139 0.0772 (0.0457) (0.0287) 0.1327 0.1173 (0.0558) (0.0417) 0.1296 0.1137 (0.0645) (0.0569) 0.1148 0.1077 (0.0489) (0.0583) 0.1498 0.0857 (0.0857) (0.0458)

Informative priors n = 500 n = 1000 0.0881 0.0757 (0.0170) (0.0012) 0.0944 0.0675 (0.0051) (0.0089) 0.2133 0.0883 (0.0327) (0.0072) 0.1000 0.0942 (0.0182) (0.0337) 0.1196 0.0790 (0.0076) (0.0088) 0.1116 0.0708 (0.0374) (0.0246)

Table 3: Average RMSD (Bias) for recovering δ1 in the multi-unidimensional NOGRM. ρ12 K = 20 0.2

K = 30 K = 20

0.5

K = 30 K = 20

0.8

K = 30

Noninformative priors n = 500 n = 1000 0.1263 0.0777 (−0.0509) (−0.0223) 0.2495 0.1475 (−0.1555) (−0.1021) 0.1353 0.1111 (−0.0467) (−0.0366) 0.2741 0.1568 (−0.1791) (−0.1071) 0.1658 0.1317 (−0.0815) (−0.0693) 0.2965 0.1513 (−0.1991) (−0.1155)

Informative priors n = 500 n = 1000 0.1112 0.0682 (−0.0455) (−0.0041) 0.2148 0.1316 (−0.1344) (−0.0966) 0.1689 0.0860 (−0.0684) (−0.0368) 0.2257 0.1508 (−0.1484) (−0.1039) 0.1617 0.1226 (−0.0768) (−0.0377) 0.2740 0.1436 (−0.1848) (−0.1096)

of them are positive, indicating that the proposed procedure tends to overestimate the item discrimination parameters, α. On the other hand, RMSD and Bias increase when more items are included. For example, for estimating α1 (ρ12 = 0.5, n = 500), the average RMSD increased from 0.1297 to 0.2150 when K increased from 20 to 30 with noninformative priors (from 0.1162 to 0.1546 with informative priors). However, such an increase in RMSD and Bias was smaller with a larger sample size (n = 1, 000) compared with a smaller one (n = 500). In addition, the choice of noninformative or informative priors for α does not result in much different parameter estimates. With respect to recovering δ1 and δ2 , Tables 3 and 4 indicate that the increase of n and decrease of K result in a smaller RMSD and a Bias close to zero. Specifically, for estimating δ2 (ρ12 = 0.5, n = 500), the average RMSD increased from 0.1117 to 0.2621 when K increased from 20 to 30 with noninformative priors (from 0.1567 to 0.2178 with informative priors). In addition, both noninformative and informative priors have similar results. However, the Bias for recovering δ1 and δ2 is consis-

BAYESIAN ESTIMATION OF A MULTI-UNIDIMENSIONAL GRADED RESPONSE IRT MODEL

87

Table 4: Average RMSD (Bias) for recovering δ2 in the multi-unidimensional NOGRM. ρ12 K = 20 0.2

K = 30 K = 20

0.5

K = 30 K = 20

0.8

K = 30

Noninformative priors n = 500 n = 1000 0.1141 0.0611 (−0.0392) (−0.0111) 0.2401 0.1264 ( −0.1445) (−0.0883) 0.1117 0.0988 (−0.0368) (−0.0303) 0.2621 0.1446 (−0.1636) (−0.0899) 0.1536 0.1218 (−0.0657) (−0.0557) 0.2842 0.1412 (−0.1853) (−0.0984)

Informative priors n = 500 n = 1000 0.0999 0.0634 (−0.0338) (−0.0082) 0.2095 0.1212 (−0.1276) (−0.0858) 0.1567 0.0739 (−0.0651) (−0.0341) 0.2178 0.1405 (−0.1374) (−0.0896) 0.1575 0.0891 (−0.0630) (−0.0273) 0.2667 0.1365 (−0.1765) (−0.0962)

Table 5: Average RMSD (Bias) for recovering ρ12 in the multi-unidimensional NOGRM. ρ12 K = 20 0.2

K = 30 K = 20

0.5

K = 30 K = 20

0.8

K = 30

Noninformative priors n = 500 n = 1000 0.0329 0.0232 (−0.0001) (0.0142) 0.0014 0.0232 (0.0006) (0.0081) 0.0268 0.0253 (0.0061) (−0.0063) 0.0297 0.0155 ( 0.0090) (0.0085) 0.0200 0.0098 (−0.0057) (0.0011) 0.0214 0.0119 (−0.0048) (−0.0034)

Informative priors n = 500 n = 1000 0.0540 0.0286 (0.0200) (0.0155) 0.0382 0.0241 (−0.0005) (0.0089) 0.0369 0.0173 (−0.0005) (0.0155) 0.0322 0.0173 (0.0109) (0.0115) 0.0232 0.0110 (0.0055) (−0.0020) 0.0214 0.0116 (0.0052) (0.0016)

tently negative, indicating that the proposed procedure underestimates the threshold parameters. Moreover, Table 5 summarizes the simulation results for recovering the intertrait correlation, ρ12 , where we see larger sample sizes and/or test lengths result in smaller RMSD values. Specifically, when ρ12 = 0.5, the RMSD decreased from 0.0297 to 0.0155 when n increased from 500 to 1000 with noninformative priors (from 0.0322 to 0.0173 with informative priors). It is noted that the effect of test length on estimating ρ12 is different from that on estimating α or δ. In addition, a close examination of the comparison of the RMSD values in Tables 1 to 5 suggests that the RMSDs for estimating ρ12 are much smaller than those for α or δ, indicating that the proposed method has more precision in estimating the intertrait correlations. This is further supported by the Bias values as reported in Table 5, which are consistently closer to zero than those for recovering α and δ. In summary, increased sample size (n) improves the precision and bias in estimating

88

T.-C. Kuo and Y. Sheng

both the item parameters (α and δ) and the intertrait correlation, which is consistent to findings with other IRT models (e.g. Linacre, 2002; Sheng, 2010). However, test length has an opposite effect on item parameters α and δ. This is possibly due to the reason that longer tests substantially increase the number of item parameters and hence increase the model complexity. It is noted that we only considered items with three scales. The complexity associated with a longer test increases with test items involving more than three scales. For example, for a test consisting of fivescale items, the estimate of item parameters will be less accurate under the same test length and sample size condition. To overcome this, as suggested by the simulation results, one needs to utilize larger sample sizes to ensure the accuracy in estimating item parameters. 3.2 Model comparison To further illustrate the proposed model, it is compared with conventional GRM using Bayesian DIC. Specifically, three IRT models were considered, namely, the multiunidimensional GRM (model 1), the unidimensional GRM (model 2), and a constrained multi-unidimensional GRM with zero correlations among the latent traits, i.e., ρst = 0 (model 3). It is noted that models 2 and 3 can be viewed as special cases of model 1 where the latent traits have either a perfect correlation or a zero correlation. Each of the models was implemented to simulated data where tests with three-scale items and two dimensions were considered and the intertrait correlation (ρ12 ) was assumed to be 0, 0.5, and 1. Specifically, polytomous responses for n = 1, 000 persons to K = 20 items were generated based on the multi-unidimensional GRM so that 10 items measured θ1 and another 10 items measured θ2 . The generation of item discrimination and threshold parameters, αv and δ, is the same as in the previous section. The Hastings-within-Gibbs procedure assuming noninformative priors for αv were subsequently implemented so that 10,000 samples were simulated with the first 2,000 set to be burn-in. Ten replications were conducted and the posterior expectations of the Gibbs samples were used to obtain the posterior estimates necessary to derive Bayesian deviance results, which were averaged across the 10 replications. The average Bayesian deviance ¯ ¯ (the posterior expectation of the deviance), D(Λ)(the estimates, including D deviance of the posterior expectation), pD (the effective number of parameters) and DIC, were obtained from each implementation and are displayed in Table 6. A close examination of the DIC values indicates that the multi-unidimensional model (model 1) consistently performs well, if not better than other models (models 2 and 3). Hence, according to DIC, the more general multi-unidimensional GRM is flexible and works well in test situations where the latent traits correlate at different levels. On the other hand, the conventional unidimensional GRM (model 2) has a consistently larger DIC and is the least preferred model when the actual intertrait correlation is less than 1. Caution is consequently advised when applying the conventional GRM for tests where the latent structure is not readily available or indicates

BAYESIAN ESTIMATION OF A MULTI-UNIDIMENSIONAL GRADED RESPONSE IRT MODEL

89

Table 6: Bayesian deviance estimates for multi-unidimensional (ρ12 = 0, ρ12 = 0.5) or unidimensional (ρ12 = 1) data fitted with the three IRT models ρ12 = 0 model 1 model 2 model 3 ρ12 = 0.5 model 1 model 2 model 3 ρ12 = 1 model 1 model 2 model 3

¯ D(Λ)

¯ D

pD

DIC

25617.62 30499.31 2562.00

27369.23 31413.15 27368.38

1751.52 914.01 1751.33

29120.54 32327.31 29119.62

25587.77 29201.32 25522.23

27276.92 30142.15 27275.32

1689.05 940.81 1753.02

28965.85 31083.00 29028.46

26349.88 26554.29 25603.00

27404.25 26554.29 27329.83

1054.19 962.80 1726.98

28458.50 28480.14 29056.83

a non-perfect correlation. One has to note that in situations where the latent traits were not correlated (ρ12 = 0), the difference between the DIC estimates for models 1 and 3 was rather trivial (less than 1). Thus, one may argue that these two models (being both reasonable) were essentially similar given that the small difference might not make a practical significance (Lee, 2007). In addition, in situations where the latent traits were perfectly correlated (ρ12 = 1), the equally reasonable unidimensional model (model 2) was not favored by DIC. This may be caused by the limitation that DIC sometimes underpenalizes and hence favors the more complicated model, i.e., model 1 in this case (Plummer, 2008). It is also interesting to note that the effective number of parameters (pD ) for model 1 was similar to that of model 3 when ρ12 = 0, but it was much smaller than that of model 3 when ρ12 > 0, and the difference increased when ρ12 was closer to 1.

4. Empirical Example To further illustrate, the proposed multi-unidimensional GRM was subsequently implemented to a subset of the Ferminist Perspective Scale (FPS; Henley et al., 1998) data. In real test situations, the actual latent structure may not be readily available. Therefore, model comparison is needed to determine the model that provides a better representation of the data. 4.1 Methodology The overall FPS as developed by Henley et al. (1998) consists of six attitudinal subscales. However, for illustrative purposes, we focused on two of them: Socialist Feminism and Cultural Feminism. The data were retrieved from the online Personality Test Website (http://personality-testing.info/ ), where responses from multiple countries, such as the United States, Australia, and Canada were obtained. In the data, each subscale consists of 10 items and each item is rated on a 5-point

90

T.-C. Kuo and Y. Sheng

Likert-type scale (with 1 labelled as “Disagree”, 3 as “Neutral” and 5 as “Agree”). After removing missing responses, we selected a random sample of 1,000 respondents and used them in subsequent analyses. To describe the polytomous response data, the three candidate models as described in the previous section, namely, the proposed multi-unidimensional GRM (model 1), the unidimensional model (model 2), and the constrained multi-unidimensional GRM where ρ12 = 0 (model 3) were considered. Each of these models was implemented to the FPS data using the Hastings-withinGibbs procedure with noninformative priors assumed for αv , where 10,000 iterations were obtained and the first 2,000 were set as burn-in. The variance of the proposal 2 , was set to be 0.025 to obtain the acceptance rate of density in Equation (2.2.4), σmh each item close to 0.5. Convergence was visually evaluated using running mean plots and trace plots. Then, the three candidate models were compared using Bayesian DICs. 4.2 Results and discussion Table 7 summarizes the Bayesian deviance results, where smaller values indicate better model fit. Among the three candidate models, model 2 has the largest pos¯ while model 1 and model 3 have similar D. ¯ terior expectation of the deviance (D) Furthermore, model 2, having the fewest effect number of parameters (pD ), seems to be the simplest model. After penalizing for model complexity, the proposed multiunidimensional GRM (model 1) has the smallest DIC, indicating that it is the best choice among the three candidate models. Thus, based on the Bayesian DIC criterion, there is evidence in favor of the more general multi-unidimensional GRM, which in turn suggests the inadequacy of the unidimensional model in this context. It is noted that model 3 has a DIC value close to model 1. However, their difference is not trivial. In addition, the effective number of parameters (pD ) for model 2 is much smaller than that for model 3. Given the simulation results presented in the previous section, and that the estimated intertrait correlation (ρ12 ) in model 1 is 0.7827, we conclude that the actual latent structure for the FPS data is closer to multi-unidimensional with the two subscales being moderately to highly correlated. Hence, a unified Ferminist Perspective trait is not sufficient in describing the specific trait levels for the Socialist and Cultural Ferminist Perspective subscales. Table 7: Bayesian deviance estimates for the ¯ ¯ D(Λ) D model 1 48076.00 49776.00 model 2 50477.00 51511.00 model 3 47899.00 49747.00

three models with the FPS data pD 1700.10 1034.30 1847.80

DIC 51476.00 52545.00 51595.00

5. Discussion In general, this study developed an MCMC algorithm for the proposed multi-

BAYESIAN ESTIMATION OF A MULTI-UNIDIMENSIONAL GRADED RESPONSE IRT MODEL

91

unidimensional GRM by extending such models for dichotomous data (Sheng & Wikle, 2007). Simulation results indicate that the developed algorithm works well in various test situations, including those with the intertrait correlation being low, moderate, or high, and the results are not sensitive to different prior distributions for α. Similar to findings with other IRT research, increasing sample sizes is a key factor affecting the accuracy in estimating item parameters. In addition, the comparison of the proposed model and conventional GRMs suggests that the unidimensional GRM can only work well in situations where latent traits have a perfect correlation, that is, when subscales in a test measure latent dimensions that correlations are all 1s. For situations where the intertrait correlation varies at different levels (i.e., each subscale measures a distinct dimension), which are more common in practice, the unidimensional model is restricted and not preferred. The proposed multi-unidimensional GRM, modeling the intertrait correlation, is theoretical more appealing and offers advantages over the unidimensional model in test situations where the latent traits correlate at different levels. This is also demonstrated using simulated data. Furthermore, when the latent structure is not a priori clear, the use of the multi-unidimensional GRM is suggested given its flexibility. We demonstrated this using the Ferminist Perspective Scale data, where the latent traits measured by the two subscales, Socialist Ferminist Perspective and Cultural Ferminist Perspective, are supposed to have a moderate to high correlation. This further suggests that even if subscales measure similar traits, the unidimensional model does not provide a description of the data as good as the multi-unidimensional model. Hence the multi-unidimensional model offers a better way to represent test situations when unidimesionality is unsure. The current study focused on graded response model with more than one latent trait. Future studies can consider other polytomous models, such as the RSM or PCM. Additionally, in many test situations, respondents are often grouped into larger units and variables are available to characterize both the respondents and the higher-level units (Fox, 2010). For example, the Ferminist Perspective Scale data were collected from respondents from different countries. One may be interested in estimating parameters at both levels, and hence further research can extend the multi-unidimensional model to a multilevel model. Moreover, in this study, Bayesian deviance was used to evaluate individual models. Given that DIC may be limited in favoring the more complicated model by not penalizing sufficiently enough model complexity (Plummer, 2008), further studies can adopt other methods for model comparisons, such as Bayes factor (Carlin & Louis, 2000), posterior predictive model checking (Sinharay & Stern, 2003) or looking at the Bayesian residuals as proposed by Albert and Chib (1993). This study focused on the probit form of the GRM, which makes it possible to derive the full conditional distribution in closed form, and consequently use Gibbs sampling for updating most model parameters. Given the ease and computational efficiency of Gibbs sampling over Metropolis-Hasting, the logistic form was not considered. Future studies can look into how this compares with the Metropolis-Hasting

92

T.-C. Kuo and Y. Sheng

procedure for the logistic GRM.

REFERENCES Albert, J.H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics, 17, 251–269. Albert, J. H. & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88, 669–679. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–573. Beguin, A. A. & Glas, C. A. W. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66, 541–562. Birnbaum, A. (1969). Statistical theory for logistic mental test models with a prior distribution of ability. Journal of Mathematical Psychology, 6, 258–276. Buchanan, R. D. (1994). The development of the Minnesota Multiphasic Personality Inventory. Journal of the History of the Behavioral Sciences, 30, 148–161. Carlin, B. P. & Louis, T. A. (2000). Bayes and empirical Bayes methods for data analysis (2nd ed.). London: Chapman & Hall. Cowels, M. K. (1996). Accelerating Monte Carlo Markov chain convergence for cumulative-link generalized linear models. Statistics and Computing, 6, 101–111. Ferero, C. G. & Maydeu-Olivares, A. (2009). Estimation of IRT graded response models: limited versus full Information methods. Psychological Methods, 14, 275–299. Fox, J. P. (2010). Bayesian item response modeling: Theory and applications. New York, NY: Springer-Verlag. Fu, Z. H.; Tao, J. & Shi, N. Z. (2010). Bayesian estimation of the multidimensional graded response model with nonignorable missing data. Journal of Statistical Computation and Simulation, 80, 1237–1252. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis (2nd ed.). London: Chapman & Hall. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. Henley, N. M., Meng, K., O’Brien, D., McCarthy, W. J., & Sockloskie, R. J. (1998). Developing a Scale to Measure the diversity of feminist attitudes. Psychology of Women Quarterly, 22, 317–348. Hoijtink, H. & Molenaar, I. W. (1997). A multidimensional item response model: constrained latent class analysis using the Gibbs sampler and posterior predictive checks. Psychometrika, 62, 171–189. Kang, T. & Cohen, A. S. (2007). IRT model selection methods for dichotomous itemss. Applied Psychological Measurement, 31, 331–358. Kelderman, H. (1996). Multidimensional rasch models for partial-credit scoring. Applied Psychological Measurement, 20, 155–168. Lee, H. (1995). Markov chain Monte Carlo methods for estimating multidimensional ability in item response theory (Doctoral dissertation). University of Missouri, Columbia, MO. Lee, S. Y. (2007). Structural equation modeling: A Bayesian approach. Chichester: John Wiley and Sons. Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 22, 5–55. Linacre, J. M. (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3, 85–106. Lord, F. M. (1980). Applications of item response theory to practical testing problems (2nd ed.). New Jersey, NJ: Hillsdale.

BAYESIAN ESTIMATION OF A MULTI-UNIDIMENSIONAL GRADED RESPONSE IRT MODEL

93

Lord, F. M. & Novick, M. R. (1968). Statistical theories of mental test scores (1st ed.). Maryland, MA: Addison-Wesley. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 147–174. Metropolis, N. & Ulam, S. (1949). The Monte Carlo method. Journal of the American Statistical Association, 44, 335–341. Molenaar, I. W. (1995). Estimation of item parameters (2nd ed.). New York, NY: Springer-Verlag. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176. Muraki, E. (1990). Fitting a polytomous item response model to Likert-type data. Applied Psychological Measurement, 14, 59–71. Patz, R. J., & Junker, B. W. (1999a). A straightforward approach to Markov chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24, 146–178. Plummer, M. (2008). Penalized loss functions for Bayesian model comparison. Biostatistics, 9, 523–539. Patz, R. J. & Junker, B. W. (1999b). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, 342–366. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests (2nd ed.). Danmark: Danmarks Paedagogiske Institute. Reckase, M. (2009). Multidimensional item response theory (2nd ed.). New York, NY: SpringerVerlag. Rizopoulos, D. (2006). ltm: An R package for latent variable modeling and item response theory analyses. Journal of Statistical Software, 17, 1–25. Rubio, V. J., Aguado, D., Hontangas, P. M., & Hernandez, J. M. (2007). Psychometric properties of an emotional adjustment measure. European Journal of Psychological Assessment, 23, 39–46. Sahu, S. K. (2002). Bayesian estimation and model choice in item response models. Journal of Statistical Computation and Simulation, 72, 217–232. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 35, 139–139. Sheng, Y. (2010). A sensitivity analysis of Gibbs sampling for 3PNO IRT models: Effects of prior =specific on parameter estimates. Behaviormetrika, 37, 87–110. Sheng, Y. (2008). A MATLAB package for Markov chain Monto Carlo with a multi-unidimensional IRT model. Journal of Statistical Software, 28, 1–20. Sheng, Y. & Headrick, T. C. (2012). A Gibbs sampler for the multidimensional item response model. ISRN Applied Mathematics, 2012, 1–14. Sheng, Y. & Wikle, C. K. (2009). Bayesian IRT models in incorporating general and specific abilities. Behaviormetrika, 36, 27–48. Sheng, Y. & Wikle, C. K. (2008). Bayesian multidimensional IRT models with a hierarchical structure. Educational and Psychological Mesurement, 68, 413–430. Sheng, Y. & Wikle, C. K. (2007). Comparing multiunidimensional and unidimensional Item Response theroy models. Educational and Psychological Mesurement, 67, 899–919. Sinharay, S. & Stern, H. S. (2003). Posterior predictive model checking in hierarchical models. Journal of Statistical Planning and Inference, 111, 209–221. Spiegelhalter, D. J., Best, N., Carlin, B., & van der Linde, A. (2002). Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society, Series B, 64, 583–640. Yao, L. (2003). BMIRT: Bayesian multivariate item response theory. Monterey, CA: CTB/McGraw-Hill.

94

T.-C. Kuo and Y. Sheng

Yao, L. & Schwarz, R. (2006). A multidimensional partial credit model with associated item and test statistics: An application to mixed-format tests. Applied Psychological Measurement, 30, 469–492. Yao, L. & Boughton, K. A. (2007). A multidimensional item response modeling approach for improving subscale proficiency estimation and classification. Applied Psychological Measurement, 31, 83–105. Zhu, X. & Stone, C. A. (2011). Assessing fit of unidimensional graded response models using Bayesian methods. Journal of Educational Measurement, 48, 81–97. (Received January 20 2015, Revised May 22 2015)