an empirical investigation of bayesian hierarchical modeling ... - J-Stage

3 downloads 38 Views 1MB Size Report
Key Words and Phrases: item response theory, normal ogive models, Gibbs ...... Kim, S.-H., Cohen, A. S., Baker, F. B., Subkoviak, M. J., & Leonard, T. (1994).
Behaviormetrika Vol.40, No.1, 2013, 19–40

AN EMPIRICAL INVESTIGATION OF BAYESIAN HIERARCHICAL MODELING WITH UNIDIMENSIONAL IRT MODELS Yanyan Sheng∗ Assuming specific values for item hyperparameters, Bayesian nonhierarchical modeling for unidimensional IRT models suffers from problems in that it relies on the availability of appropriate prior information for the three-parameter model or for small datasets. These problems can be resolved by specifying priors in a hierarchical fashion so that the item hyperparameters are unknown and have their own prior distributions. This study investigated the performance of such hierarchical modeling by comparing it with the nonhierarchical approach using Monte Carlo simulations. Their results provided empirical evidence for the advantage of using hierarchical priors in modeling unidimensional item response data when appropriate prior information is not readily available and when datasets are not sufficiently large.

1. Introduction Simultaneous estimation of both item and person parameters in item response theory (IRT) models results in statistical complexities in the estimation task, which have made the estimation procedure a primary focus of psychometric research over decades (Birnbaum, 1969; Bock & Aitkin, 1981; Molenaar, 1995). With current enhanced computational technology and the emergence of Markov chain Monte Carlo (MCMC) simulation techniques (Smith & Roberts, 1993; Tierney, 1994), recent attention is focused on fully Bayesian methodology. MCMC methods have proved useful in practically all aspects of Bayesian inference, such as parameter estimation and model comparisons (Carlin & Louis, 2000; Chib & Greenberg, 1995; Gelfand & Smith, 1990; Gelman, Carlin, Stern & Rubin, 2003). A key reason for the widespread interest in them is that they are extremely general and flexible and hence can be used to sample univariate and multivariate distributions when other methods (e.g., marginal maximum likelihood) either fail or are difficult to implement. In addition, MCMC allows one to model the dependencies among parameters and sources of uncertainty (Tsutakawa & Johnson, 1990; Tsutakawa & Soltys, 1988). One of the simplest MCMC algorithms is Gibbs sampling (Geman & Geman, 1984). The method is straightforward to implement when each full conditional distribution associated with a particular multivariate posterior density is a known distribution that is easy to sample. Albert (1992; see also Baker, 1998) was the first to apply Gibbs sampling to the unidimensional two-parameter normal ogive (2PNO; Lord & Novick, 1968) IRT model using the data augmentation idea of Tanner and Wong Key Words and Phrases: item response theory, normal ogive models, Gibbs sampling, hyperpriors, noninformative, weakly informative, informative, 3PNO models, 2PNO models ∗ Department of Educational Psychology & Special Education Southern Illinois University Carbondale Wham 223, Mailcode 4618 Carbondale, IL 62901 Mail Address: [email protected]

20

Y. Sheng

(1987). Sahu (2002; see also B´eguin & Glas, 2001; Johnson & Albert, 1999) further generalized the approach to the three-parameter normal ogive (3PNO; Lord, 1980) model. In their model specifications, item hyperparameters take specific values. Such procedure, however, has been found to rely on the availability of appropriate prior information for the 3PNO model or for small datasets, as is illustrated in the following subsection. 1.1 Prior specifications The fully Bayesian estimation with IRT models offers the flexibility of setting prior distributions for model parameters or hyperparameters. The existing literature in Bayesian statistics (see e.g., Gelman, et al., 2003) indicates that prior distributions take one of the two general forms so that they are either fully informative using application-specific information or noninformative. Each of these is adopted depending on the availability of appropriate prior information, which can be obtained through expert opinion or past study. If information is not available or appropriate, vauge, noninformative or diffuse priors are usually chosen because of this uncertainty (e.g., Rupp, Dey, & Zumbo, 2004). On the other hand, when appropriate prior information is available, informative prior densities with small variances are considered. They are especially useful for small datasets, as model parameters can be constrained from assuming unreasonable values and hence estimated more accurately when data are not themselves very informative (Lim & Drasgow, 1990; Mislevy, 1986; Swaminathan & Gifford, 1983, 1985, 1986). The effect of informative priors for small datasets is particularly true with Albert’s (1992) Gibbs sampling procedure for 2PNO models (see Sheng, 2010). Furthermore, the Gibbs sampling procedure for 3PNO models developed by Sahu (2002) suffers from a nonconvergence problem when the prior densities for item slope and intercept parameters are not strongly informative. In particular, improper noninformative priors for these item parameters result in an undefined posterior distribution, which gives rise to unstable parameter estimates (e.g., Sheng, 2008). Even with proper noninformative prior densities, the procedure either fails to converge or requires an enormous number of iterations for the Markov chain to reach convergence (Sheng, 2010). On the other hand, priors for the pseudo-chance-level parameter in the 3PNO model can be chosen in a typical fashion, as its posterior estimates are not sensitive to informative or noninformative prior specifications (Sheng, 2008, 2010). Hence, parameters of the 2PNO model for small datasets, or those of the 3PNO model can only be accurately estimated when prior information regarding item intercepts and slopes is available and adequate, as a misspecified informative prior would result in biased estimates of these parameters (Sheng, 2010), which potentially leads to inaccurate estimates of person latent traits. This actually limits the use of Gibbs sampling for the unidimensional IRT models.

BAYESIAN HIERARCHICAL MODELING WITH IRT MODELS

21

1.2 Hyperpriors In many situations, Bayesian hierarchical models are developed where an additional prior distribution can be specified for a hyperparameter of a prior distribution. These second-order priors are called hyperpriors, and are useful for incorporating uncertainty in the hyperparameters (e.g., scale or location parameters) of a prior distribution (Carlin & Louis, 2000). Estimating hyperparameters from the hyperpriors, and the data, reduces the difficulty of specifying the parameter values that lead to adequate priors for item parameters, and therefore is a “method to elicit the optimal prior distributions” (Hall, 2012, p.10). This makes hierarchical modeling appealing due to both theoretical and practical reasons (see Gelman, 2006, p.515 for a detailed description). In the IRT literature, much of the attention has focused on the development of Bayesian hierarchical models (e.g., Bradlow, Wainer, & Wang, 1999; Fox, 2010; Swaminathan & Gifford, 1982, 1985, 1986). However, to date little work has investigated the effect of hyperpriors on model parameters (see e.g., Kim, Cohen, Baker, Subkoviak, & Leonard, 1994). Given that the advantages of hierarchical modeling over nonhierarchical modeling have been empirically demonstrated in other areas, such as Bayesian treed models (Chipman, George, & McCulloch, 2002), fossil calibration (Heath, 2012), multiple-recapture population estimation (Clark, Ferraz, Oguge, Hays, & DiCostanzo, 2005), and Cardiovascular therapies (Kwok & Lewis, 2011), it is believed that the Bayesian estimation with hierarchical priors specified for item hyperparameters shall provide us with a better approach in modeling unidimensional item response data when appropriate prior information is not readily available and when datasets are not sufficiently large. 1.3 Purpose of the study In view of the above, this study focuses on specifying priors in a hierarchical fashion so that item slope and intercept hyperparameters in unidimensional IRT models are unknown and have their own prior distributions (or hyperpriors). The performance of such hierarchical modeling is further investigated by comparing its parameter recovery with that of the conventional procedure where the hyperparameters take specific values. It is believed that this approach helps resolve the noncovergence problem introduced by setting noninformative or weakly informative priors for the conventional procedure when prior information is not available or appropriate. Moreover, this kind of modeling scheme actually allows a more objective approach to inference by estimating the hyperparameters from data rather than specifying them using subjective information. This further allows for inference that is better at reflecting our statistical understanding of the distribution of item parameters. It has to be noted that when the full conditional distributions cannot be obtained in closed form, more complicated MCMC procedures have to be adopted. For example, Patz and Junker (1999a, 1999b) adopted the Metropolis-Hastings within Gibbs (Chib

22

Y. Sheng

& Greenberg, 1995) for the two-parameter logistic (2PL) and the three-parameter logistic (3PL) models. As Gibbs sampling is relatively easier to implement, and the logistic and normal ogive models are identical in model fit or parameter estimates (Birnbaum, 1968; Embretson & Reise, 2000), the MCMC procedures for logistic models are not considered in this study. 1.4 Structure of this paper The remainder of the paper is organized as follows. The IRT model with hierarchical priors is briefly outlined in Section 2, with a description of the Gibbs sampling procedure and prior specifications for the model parameters. Section 3 presents a simulation study on the performance of the hierarchical three-parameter model by comparing it with the model where prior densities are not specified in a hierarchical fashion. The simulations differentiated between situations where prior information was available and appropriate from those where the available prior information was not appropriate. For the purpose of comparisons, a similar simulation study is presented in Section 4 pertaining to the performance of the simpler two-parameter model. Finally, a few summary remarks are provided in Section 5.

2. Model and the Gibbs sampling procedure The unidimensional IRT model provides a fundamental framework in modeling the person-item interaction by assuming one latent dimension. Suppose a test consists of k binary response items, each measuring a single unified trait, θ. Let y = [yij ]n×k denote a matrix of n persons’ responses to these k items, where yij = 1 (yij = 0) if the i-th person answers the j-th item correctly (incorrectly) for i = 1, . . . , n and j = 1, . . . , k. The probability of yij = 1 is defined for the 3PNO model as P (yij = 1|θi , αj , βj , γj ) = γj + (1 − γj )Φ(αj θi − βj ) αjθi −βj

= γj + (1 − γj ) −∞

1 −t2 √ e 2 dt, 2π

0 ≤ γj < 1,

(1)

where θi is the latent trait parameter, αj is a slope parameter describing item discrimination, βj is an intercept parameter associated with item difficulty βj∗ such that βj = αj βj∗ , and γj is a pseudo-chance-level parameter, indicating that the probability of correct response is greater than zero even for those with very low trait levels. This model is applicable for objective items such as multiple-choice or true-or-false items where an item may be too difficult for some persons. When γj = 0, the model simplifies to the 2PNO model, as defined in P (yij = 1|θi , αj , βj ) = Φ(αj θi − βj ).

(2)

Hence, the three-parameter model is more general and is applicable to a variety of

BAYESIAN HIERARCHICAL MODELING WITH IRT MODELS

23

testing situations where the two-parameter model may be inadequate. Given this, the Bayesian hierarchical modeling is illustrated for the 3PNO model only, which is as follows. To implement Gibbs sampling to the 3PNO model defined in (1), two latent variables, Z and W , are introduced such that Zij ∼ N (ηij , 1) (Albert, 1992; Tanner & Wong, 1987), where ηij = αj θj − βj , and  1, if person i knows the correct answer to item j Wij = , 0, if person i doesn’t know the correct answer to item j with a probability density function P (Wij = wij |ηij ) = Φ(ηij )wij (1 − Φ(ηij ))1−wij

(3)

(B´eguin & Glas, 2001; Sahu, 2002). Prior distributions can be assumed for θi , ξj and γj , where ξj = (αj , βj ) . This paper focuses on conjugate priors. In the literature, a standard normal prior is commonly adopted for the person parameters θi (Albert, 1992; Baker, 1998; B´eguin & Glas, 2001; Glas & Meijer, 2003; Johnson & Albert, 1999) to ensure a unique scaling and hence to resolve a particular identification problem in the model (see e.g., Albert, 1992). With prior information, a beta prior distribution is usually assumed for γj , a normal prior is assumed for βj (B´eguin & Glas, 2001; Glas & Meijer, 2003; Johnson & Albert, 1999; Sahu, 2002), and a truncated normal prior is assumed for αj (Sahu, 2002) so that γj ∼ Beta(s, t), βj ∼ N (μβ , σβ2 ), and αj ∼ N(0,∞) (μα , σα2 ). Note that N(0,∞) (μα , σα2 ) is a left truncated normal density with αj values lie between 0 and ∞ (see Sahu, 2002; Sheng, 2010). Values of the hyperparameters in the prior distribution can be specified differently given prior information. It is noted that smaller values of σα2 and σβ2 , while larger values of s and t lead to more precise prior information. Here in the paper, we consider hierarchical modeling with hyperpriors assumed for the item slope and intercept hyperparameters. Hence, with prior distributions specified for μα , σα2 , μβ and σβ2 , the joint posterior distribution of (θ, ξ, γ, W, Z, μξ , Σξ ) is p(θ, ξ, γ, W, Z, μξ , Σξ |y) (4) ∝ f (y|W, γ)p(W|Z)p(Z|θ, ξ)p(θ)p(γ)p(ξ|μξ , Σξ )p(μξ )p(Σξ ),   y n k where μξ = (μα , μβ ) , Σξ = diag(σα2 , σβ2 ), and f (y|W, γ) = i=1 j=1 pijij (1−pij )1−yij is the likelihood function, with pij being the model probability function as defined in (1). Prior distributions for the hyperparameters μα , σα2 , μβ and σβ2 can be specified in the usual manner. A common choice for the noninformative uniform prior would be uniform on log(σα ) or log(σβ ). However, it results in an improper posterior distribution and hence a uniform prior density on σα or σβ is considered (see Gelman, 2006 for an illustration). Alternatively, one can impose a conjugate normal prior density for μα or μβ , and an inverse Gamma (IG) prior for σα2 or σβ2 . The full conditional distribution of each parameter or hyperparameter can subsequently be derived in closed

24

Y. Sheng

form and updated iteratively using Gibbs sampling (see the Appendix for the derived full conditional distributions).

3. Simulation 1 3.1 Method To illustrate the performance of such hierarchical modeling with 3PNO models, a simulation study was conducted, comparing it with the nonhierarchical model where item hyperparameters take specific values. In the simulation, prior information regarding item slopes and intercepts was assumed to be (a) available and appropriate, or (b) available but not appropriate. Under either condition, three factors were manipulated, namely, sample sizes, test lengths and specifications of the item slope and intercept hyperparameters. In particular, item responses for k items (k = 10, 20, 100) and n individuals (n = 100, 300, 1000, 5000) were generated according to the 3PNO model, as defined in (1). Person trait parameters were generated as samples from a standard normal distribution. Item pseudo-chance-level parameters were generated from γ ∼ U (.05, .25). Item slope and intercept parameters were generated from (a) α ∼ U (0, 2) and β ∼ U (−2, 2) for situations where prior information regarding α and β was available and appropriate, or (b) α ∼ U (2, 3.2) and β ∼ U (2, 4) for situations where prior information was available but not appropriate. When implementing the MCMC procedure, a diffuse Beta(1, 1) prior was assumed for γj . In addition, five ways of setting μα , σα2 , μβ and σβ2 were considered such that the hyperparameters Spec 1: took specific values μα = μβ = 0, σα2 = σβ2 = 100; Spec 2: took specific values with much smaller variances μα = μβ = 0, σα2 = σβ2 = 1; Spec 3: had noninformative uniform hyperpriors p(μα , σα2 ) ∝ 1/σα and p(μβ , σβ2 ) ∝ 1/σβ ; Spec 4: had proper conjugate hyperpriors with a N (0, 10000) for μα or μβ and an IG(2, 1) for σα2 or σβ2 ; Spec 5: had relatively more informative conjugate hyperpriors with a N (0, 100) for μα or μβ and an IG(3, 2) for σα2 or σβ2 . It is noted that the first two specifications where the hyperparameters took specific values pertain to situations where priors were not set in a hierarchical fashion, whereas the last three specifications pertain to the hierarchical 3PNO model illustrated in Section 2. In nonhierarchical models, the spread of the prior distribution plays a critical role in parameter estimation, as smaller values of σα2 and σβ2 result in values that are closer to the prior mean and hence are considered as being more informative. Moreover, previous work with such models indicates that conjugate priors with a large scale hyperparameter, e.g., σα2 = σβ2 = 1010 , leads to nonconvergent Markov chains (Sheng, 2010). Hence, the first two specifications were determined such that the priors were either highly informative with σα2 = σβ2 = 1 or not that highly informative with σα2 = σβ2 = 100. With respect to hierarchical models, relatively less

BAYESIAN HIERARCHICAL MODELING WITH IRT MODELS

25

informative priors are generally considered for model hyperparameters to reflect uncertainty about them (e.g., Williams & Locke, 2003). In the Bayesian literature, prior distributions for the scale hyperparameter in hierarchical linear models usually take the form of an improper uniform prior on σ (Gelman, 2006), or a conjugate inverse Gamma prior (e.g., Spiegelhalter et al., 1996). An IG(2, 1) distribution has a mean of 1 with an infinite variance, and an IG(3, 2) distribution has a mean of 1 and a variance of 1. Consequently, the last three specifications with hierarchical priors were chosen to reflect prior knowledge on item hyperparameters to be improper noninformative, proper weakly informative, and proper informative. For ease of illustration, the five prior specifications will be denoted using Spec 1 through Spec 5, respectively, in the subsequent discussions. One has to note that Spec 2, where the first-order priors were highly informative, was appropriately specified for the situation where prior information regarding α and β was assumed to be available and appropriate, as the generated values were within 2 standard deviations of their respective prior means. Hence, small values for σα2 and σβ2 in this specification were expected to help stabilize the posterior estimates. Indeed, as Sheng (2010) indicates, Bayesian nonhierarchical modeling for 3PNO models performs well in situations where informative priors are assumed for item intercept and slope parameters. One has to be careful in specifying such small prior variances as in practical applications, the available prior information may not always be appropriate, such as the use of Spec 2 for data that were generated using α ∼ U (2, 3.2) and β ∼ U (2, 4). The generated values differed from the majority of the values from their prior distributions because they were about 2 to 4 standard deviations above their respective prior means. This usually happens when in practice, the present data differ considerably from the past data. If prior information is obtained from the past data, the small prior variance (such as those in Spec 2) may constrain posterior estimates to take values that are close to the misspecified prior mean, which in turn result in biased estimates if there is not sufficient signal in the data to inform inference. Hierarchical modeling of the 3PNO model, on the other hand, does not rely on the specification of informative priors as the hyperparameters do not have to be specified, and hence was expected to offer advantages in such situations. The Gibbs sampling procedure was implemented to the simulated data where 10,000 iterations were obtained with the first 5,000 as burn-in. Convergence was evaluated using the Gelman-Rubin R (Gelman & Rubin, 1992) statistic. The usual practice is using multiple Markov chains from different starting points. Alternatively, a single chain can be divided into sub-chains so that convergence is assessed by comparing the between and within sub-chain variances (Fox, 2007). Since a single chain is less wasteful in the number of iterations needed, the latter approach was adopted. For each Markov chain, the initial values were set to be αj = 1, βj = 0, and γj = 2 for all items j and θi = 0 for all persons i. After discarding the burn-in samples, the chain was then separated into five sub-chains of equal length and the R statistic was calculated following the procedure by Gelman and Rubin (1992). Convergence can also be monitored visually using time series graphs of the simulated sequence, such as the

26

Y. Sheng

trace plot and the running mean plot shown in the next section. Inspection of such plots, however, has been criticized for being unreliable and unwieldy in the presence of a large number of model parameters (Gelman et al., 2003; Nylander, Wilgenbusch, Warren, & Swofford, 2008). The R statistic obtained from using a single chain was hence the major approach for assessing convergence in this study. For each simulated scenario, 100 replications were conducted to avoid erroneous results in estimation due to sampling error. The accuracy of parameter estimates was evaluated using the root mean square error (RM SE) and bias. Let τ denote the true value of a parameter (αj , βj , or γj ) and tr its estimate in the r-th replication (r = 1, . . . , R). The RM SE is defined as  R r=1 (tr − τ ) , (5) RM SEτ = R and the bias is defined as

R BIASτ =

r=1 (tr

− τ)

, R These quantities were averaged over items to provide summary indices.

(6)

3.2 Results and discussion For either set of the simulations where the available prior information was assumed to be appropriate or not appropriate, convergence was evaluated for the posterior samples of αj , βj and γj using Gelman-Rubin R statistics as well as trace plots and running mean plots, such as those shown in Figure 1 for the intercept parameter of one item. It is observed that except for Spec 1 where σα2 = σβ2 = 100, all the chains converged to their stationary distributions within 10,000 iterations under the simulated conditions. Hence, when the 3PNO model is not specified in a hierarchical fashion, prior densities of αj and βj have to be strongly informative (such as Spec 2) to stabilize posterior estimates so that the Markov chain can converge within a certain chain length. This agrees with findings from Sheng (2010). On the other hand, with hierarchical modeling, convergence can be achieved even with noninformative or weakly informative priors, such as Specs 3 and 4. Obviously, convergence for Spec 1 may be improved by increasing the chain length. Since the focus of the study was on the performance of the hierarchical models, additional iterations were not adopted. Given that the Markov chains did not converge under this specification, its results are not reported or further discussed. With converged Markov chains, the posterior estimates for the model with the last four prior specifications were obtained as the posterior expectations of the Gibbs samples and the average RM SE and bias values for αj and βj are graphically presented in Figures 2 and 3 for conditions where prior information was assumed to be available and appropriate. A close examination of these plots indicates: • In all the simulated scenarios, Spec 2 consistently results in a smaller estimation

BAYESIAN HIERARCHICAL MODELING WITH IRT MODELS

27

Figure 1: Trace plots, running mean plots, and Gelman-Rubin R statistics of α for one item with each of the five prior specifications for the hyperparameters in the 3PNO model (k = 20, n = 1000).

error with less bias. Increased sample sizes help make Specs 3, 4 and 5 closer to Spec 2. Specifically, when sample sizes are large (i.e., n ≥ 1000), Spec 3 has fairly similar RM SE and bias values as Spec 2. • Increased sample sizes (n) and/or test lengths (k) tend to improve the accuracy in estimating αj or βj with smaller bias. However, the effect of test length on βj is somewhat different: increased k does not necessarily improve the estimation accuracy or reduce bias. For example, the average bias value for Spec 2 increases slightly as k increases from 10 to 100. The effect of the test length is more on the differences in the error or bias in estimating βj using Specs 2 through 5. In particular, as k increases, the differences among Specs 3, 4 and 5 become smaller whereas those between Spec 2 and the other three specifications become larger (see Figure 3). • Among the three specifications with hierarchical priors, Spec 3 performs worse than Spec 4 or 5 when sample sizes or test lengths are small. With increased n (e.g., n ≥ 1000), Spec 3 starts to involve less estimation error and bias than Spec 4 or 5. Hence, when prior information is available and appropriate, conventional MCMC procedure shall be used to specify informative prior densities for item slope and intercept parameters in 3PNO models. When sample sizes are large enough (e.g., n > 1000), hierarchical modeling may be adopted to achieve similar level of accuracy in estimating model parameters. With respect to situations where prior information for αj or βj was assumed to be available but not appropriate, Figures 4 and 5 summarize the parameter recovery results. From the plots, it is apparent that Specs 3, 4 and 5 consistently outperform Spec 2 in all the simulated scenarios. Among the three specifications with hyperpriors, those with less informative priors (Specs 3 and 4) result in smaller error or bias than

28

Y. Sheng

Figure 2: RM SE and bias for α in the 3PNO model when prior information is assumed to be available and appropriate.

the one with relatively more informative priors (Spec 5) for small datasets (k ≤ 20, n ≤ 300). When the test length or sample size increases, their differences become negligible. Moreover, sample sizes and test lengths again have an effect on the posterior estimates. In particular, an increased sample size results in smaller average RM SE and bias, making Spec 2 close to the other three specifications. On the other hand, an increased test length tends to improve accuracy and reduce bias for Specs 3, 4 and 5, but not for Spec 2, making their differences larger. In effect, it even results in a larger estimation error and bias in estimating βj with Spec 2 (see Figure 5). One possible explanation is that the misspecification associated with Spec 2 is with respect to each item parameter, and when there are more items, more misspecifications are involved and hence the effect is larger. These results suggest that misspecified item hyperparameters, such as those in Spec 2, tend to introduce additional error and bias in parameter estimation, which is consistent with findings from Sheng (2010). When prior information is incorrect or not available, estimating the hyperparameters from data rather than specifying them using subjective information is suggested for the 3PNO model. Hierarchical modeling offers such flexibility by making it possible to specify non- or weakly informative prior

BAYESIAN HIERARCHICAL MODELING WITH IRT MODELS

29

Figure 3: RM SE and bias for β in the 3PNO model when prior information is assumed to be available and appropriate.

densities for item hyperparameters. With hyperparameters estimated from the data and their hyperpriors, the procedure allows a more objective approach to inference in this context. Additionally, when it comes to item hyperpriors, it is suggested that relatively less informative, or flatter distributions be considered.

4. Simulation 2 4.1 Method For the purpose of comparisons, a similar simulation study was carried out with the Gibbs sampling procedure for the hierarchical 2PNO model, where five specifications of the hyperparameters for αj and βj (as listed in the previous section) were considered. Item responses for k items (k = 10, 20, 100) and n individuals (n = 100, 300, 1000, 5000) were generated using the 2PNO model as in (2), in which person trait parameters were generated as samples from a standard normal distribution, and item parameters were generated as samples from uniform distributions so that (a) αj ∼ U (0, 2) and βj ∼ U (−2, 2) assuming their prior information was avail-

30

Y. Sheng

Figure 4: RM SE and bias for α in the 3PNO model when prior information is assumed to be available but not appropriate.

able and appropriate, or (b) αj ∼ U (2, 3.2) and βj ∼ U (2, 4) assuming the available prior information was not appropriate. Gibbs sampling was implemented to fit the 2PNO model to simulated data with parameters’ initial values being αj = 1 and βj = 0 for all items and θi = 0 for all persons. 10,000 iterations of the Gibbs samples were obtained with the first half set as burn-in and the posterior expectations of them were obtained as the posterior estimates. With 100 replications, the accuracy of parameter estimates was evaluated using the RM SE and bias, as defined in (5) and (6). Their values were averaged over items to provide summary indices. 4.2 Results Convergence was evaluated for the posterior samples of αj and βj using GelmanRubin R statistics as well as time series plots, which indicate that when sample size was not large (n ≤ 300), the Markov chains did not reach their stationary distributions for Spec 1 within 10,000 iterations and hence the results are not reported. It is noted that the degree of nonconvergence was not as serious as that of the 3PNO model

BAYESIAN HIERARCHICAL MODELING WITH IRT MODELS

31

Figure 5: RM SE and bias for β in the 3PNO model when prior information is assumed to be available but not appropriate.

under the same conditions, and that by increasing the chain length, this problem shall be resolved. On the other hand, convergence has been observed for other prior specifications and/or under other sample size conditions. Hence, as was observed with the 3PNO model, one of the advantages hierarchical modeling offers for 2PNO models is that with non- or weakly informative priors, the Markov chain converges more quickly when sample size is not sufficiently large. For the Markov chains that have reached stationarity, the posterior estimates were obtained as the posterior expectations of the Gibbs samples and the average RM SE and bias values for αj and βj are plotted in Figures 6 and 7 for situations where their prior information was assumed to be available and appropriate. These plots suggest: • When the sample size is small, e.g., n < 300, Spec 2 has a slight advantage over the specifications with hyperpriors (i.e., Specs 3, 4 and 5) in the accuracy, if not bias, in estimating αj or βj . Their differences, and particularly those between Spec 2 and Spec 4 or 5, are fairly small. When the sample size gets larger (n > 300), Specs 3, 4 and 5 are identical to Spec 2 in the estimation accuracy. • Increased n and/or k tend to improve the accuracy, as well as to reduce the differences among the five specifications in the accuracy of estimating αj or βj .

32

Y. Sheng

Figure 6: RM SE and bias for α in the 2PNO model when prior information is assumed to be available and appropriate.

• It is worth noting that Spec 1 requires n > 300 to converge and needs a very large sample size (n = 5000) to perform similarly to the other specifications in recovering αj or βj . When comparing between the first two prior specifications where the model is not hierarchical, Spec 2 is relatively better with smaller average RM SE and bias. • Among the three specifications involving hyperpriors, Spec 3 performs worse than Spec 4 or 5 when the sample size or test length is small. With larger datasets (e.g., n ≥ 300 and/or k ≥ 20), the three specifications result in estimates with identical error and bias. Hence, when prior information is available and appropriate, one may use it to specify values for the item hyperparameters to obtain accurate parameter estimates with little bias. Alternatively, one can also assume hyperpriors for item intercept and slope hyperparameters to obtain similar results for the 2PNO model, particularly when sample sizes are not very small (n ≥ 300). This is, however, different from what we observed with the 3PNO model. It is further noted that for nonhierarchical models, a more informative prior is preferred when sample sizes are not large enough (n < 5000). For the situation where the available prior information was assumed to be not ap-

BAYESIAN HIERARCHICAL MODELING WITH IRT MODELS

33

Figure 7: RM SE and bias for β in the 2PNO model when prior information is assumed to be available and appropriate.

propriate, the parameter recovery results are graphically displayed in Figures 8 and 9. From them, it is apparent that Spec 3 tends to outperform the other specifications as far as accuracy and bias are considered, whereas Spec 2 always involves larger error and bias in estimating αj or βj . In addition, when sample sizes are large enough, e.g., n > 300, the three specifications involving hyperpriors (Specs 3, 4 and 5) are fairly close in recovering αj or βj . When sample sizes are very large (n = 5000), Spec 1 results in similar RM SE and bias values as Specs 3, 4 and 5. As observed previously, an increased sample size results in smaller error and bias. Further, as the sample size increases, the five specifications perform more similarly in recovering item parameters. These are, however, not observed for increased test length, which may reduce error or bias for Specs 3 to 5, but not for Spec 2. In effect, as k increases, the differences between Spec 2 (or Spec 1) and the other three specifications become larger. Consequently, it is concluded that hierarchical modeling, and particularly the one with noninformative uniform prior densities for item slope and intercept hyperparameters is recommended for estimating 2PNO model parameters when available prior information is not appropriate. Specifying less highly informative priors for item parameters in the nonhierarchical model is not recommended if sample sizes are not very

34

Y. Sheng

Figure 8: RM SE and bias for α in the 2PNO model when prior information is assumed to be available but not appropriate.

large, e.g., n < 5000. It is also noted that when sample size (not test length) gets large, the bias introduced by misspecification of using Spec 2 is not that serious in estimating model parameters. This further suggests that with enough subjects (e.g., n ≥ 5000) for tests that are not long (k ≤ 20), namely, when there are sufficient signal in the data to inform inference about item parameters, the conventional MCMC procedure (or, the nonhierarchical model) with informative priors may be considered even if appropriate prior information is not available.

5. Discussion In summary, results of the simulation studies have provided empirical evidence for the advantage of Bayesian hierarchical modeling over the conventional nonhierarchical modeling where hyperparameters take specific values under the simulated test situations. In particular, when prior information is not available, or when it is available but not appropriate, vague priors with a large variability have to be adopted for item intercept and slope parameters to avoid biased estimates. However, difficulty arises from the conventional approach for two-parameter models with an insufficient sample

BAYESIAN HIERARCHICAL MODELING WITH IRT MODELS

35

Figure 9: RM SE and bias for β in the 2PNO model when prior information is assumed to be available but not appropriate.

size (n ≤ 300) or for three-parameter models as the Markov chains either fail or need a large number of iterations to converge (Sheng, 2008, 2010). Specifying these models in a hierarchical fashion resolves this problem, and the solution involves selecting noninformative or weakly informative hyperpriors (such as Specs 3 and 4 in the study) for item hyperparameters. This approach allows one to estimate each hyperparameter from a wide interval that it falls in instead of specifying a specific value for it. And the vague hyperprior does not seem to affect posterior precision like a vague prior (such as Spec 1) does. This can be explained by the fact that the uncertainty is at the second or the hyperprior level instead of at the first level of the prior, and its effect on the resulting posterior distribution is so small that the posterior is mainly determined by the data (Williams & Locke, 2003). When prior information is available and appropriate, specifying strongly informative priors in a nonhierarchical fashion is useful in that it not only helps reduce bias and increase precision in estimating parameters but also helps convergence of the Markov chain for the 2PNO model with small datasets and for the 3PNO model (Sheng, 2010). The same results have been observed with hierarchical modeling for 2PNO models when sample size is not small (n > 100) and for 3PNO models with

36

Y. Sheng

a reasonably large sample size (n ≥ 5000). Consequently, when the present data are similar to the past data and when sample size meets the above mentioned criterion, one can either use the information about past data to specify priors for item parameters in a nonhierarchical fashion, or use hierarchical modeling with relatively vague hyperpriors to estimate item hyperparameters. Hence, in practice, if one is not certain about the appropriateness of the priors adopted for the item parameters, it is preferable to test the sensitivity of MCMC estimates to informative priors in a nonhierarchical model and vague hyperpriors in a hierarchical model than to choose an arbitrary informative prior that may markedly bias the results. When it comes to specifying hyperprior distributions, weakly informative priors (such as Spec 4 in the study) are sometimes preferred than noninformative priors when sample size is not reasonably large. A weakly informative prior is proper and provides “stable, regularized estimates while still being vague enough” (Gelman, Jakulin, Pittau, & Su, 2008, p.1361), and hence is sometimes recommended. However, selecting a weakly informative prior can be tricky. In practice, one can start with a simple, relatively noninformative prior for item hyperparameters and seek to add information if there remains much variation in the posterior distribution. Although this study focuses on unidimensional models, its results shed light on the use of hierarchical modeling for similar and yet more complicated Bayesian IRT models, such as those involving multiple latent traits (see e.g. B´eguin & Glas, 2001; Lee, 1995; Sheng & Headrick, 2012; Sheng & Wikle, 2007, 2008, 2009 for the developed Bayesian nonhierarchical models). However, given that these models involve different levels of complexity, the performance of hierarchical modeling may vary. This further calls for additional research on the use of hyperpriors in modeling multidimensional item response data. It has to be noted that the conclusions based on the simulation studies in this paper are limited to the situations considered and may not be generalized to other actual test situations. For example, in the simulations, item intercept parameters were generated from uniform distributions with a range smaller than what usually occurs in practice. They were adopted to look at items with a medium difficulty level. Hence, the results from the study cannot be generalized to tests with difficult or easy items. Further studies are needed to investigate the performance of hierarchical modeling with unidimensional IRT models if the interest is on conditions not specified in this paper. In addition, the two specifications with conjugate hyperpriors pertaining to the hierarchical models (i.e., Spec 4 and Spec 5) assumed an inverse Gamma prior distribution for the scale hyperparameter σα2 or σβ2 . Studies in Bayesian hierarchical modeling suggest that care has to be taken when specifying this type of hyperpriors (see e.g., Browne & Draper, 2006) and that an inverse Gamma family may not be a good choice for scale parameters (Gelman, 2006). Further studies are needed to evaluate the performance of such hierarchical modeling for unidimensional IRT models with different prior densities specified for the item slope and intercept scale hyperparameters. Finally, this study only considered uniform and conjugate hyperpriors for αj and βj . Other forms of prior densities may be considered in future investigations.

BAYESIAN HIERARCHICAL MODELING WITH IRT MODELS

37

Appendix For model parameters, θi , ξj and γj , of the 3PNO model as defined in (1), conjugate priors can be specified such that θi ∼ N (μ, σ 2 ), αj ∼ N(0,∞) (μα , σα2 ), βj ∼ N (μβ , σβ2 ), and γj ∼ Beta(s, t). The full conditional distributions of Wij , Zij , θi , ξj and γj can subsequently be derived in closed forms as follows: ⎧ Φ(ηij ) ⎪ ⎨Bernoulli , if yij = 1 γj + (1 − γj )Φ(ηij ) Wij |• ∼ , (A.1) ⎪ ⎩ Bernoulli(0), if yij = 0

 Zij |• ∼



N(0,∞) (ηij , 1),

if Wij = 1

N(−∞,0) (ηij , 1),

if Wij = 0

,

 + βj )αj + μ/σ 2 1   θi |• ∼ N , , 1/σ 2 + j αj2 1/σ 2 + j αj2   −1   −1 −1 ξj |• ∼ N (x x + Σ−1 I(αj > 0), ξ ) (x Zj + Σξ μξ ), (x x + Σξ ) j (Zij

(A.2)

(A.3) (A.4)

where x = [θ, −1] and I(αj > 0) is an indicator function that equals 1 if αj > 0 and 0 otherwise, γj |• ∼ Beta(aj + s, bj − aj + t), (A.5) where aj denotes the number of correct responses obtained by guessing for item j, bj denotes the number of persons who do not know the correct answer to item j, With respect to μα , σα2 , μβ and σβ2 , noninformative priors can be specified so that p(μα , σα2 ) ∝ 1/σα and p(μβ , σβ2 ) ∝ 1/σβ . One has to note that they are uniform on μα , σα , μβ and σβ . The full conditional distributions for these hyperparameters are then derived as  αj σα2 μα |• ∼ N , , (A.6) k k  2 βj σ β , , (A.7) μβ |• ∼ N k k  (αj − μα )2 k−1 2 σα |• ∼ IG , , (A.8) 2 2  (βj − μβ )2 k−1 2 σβ |• ∼ IG , . (A.9) 2 2 Alternatively, conjugate prior distributions can be specified for μα , σα2 , μβ and σβ2 so that μα ∼ N (0, σα2¯ ), μβ ∼ N (0, σβ2¯), σα2 ∼ IG(ε1 , ε2 ) and σβ2 ∼ IG(ζ1 , ζ2 ). This paper considered only proper prior densities so that ε1 ≥ 2 and ζ1 ≥ 2. When prior information is not available, they can be specified to be relatively flat to ensure that the posterior estimates depend mostly on the data. The full conditional distributions

38

Y. Sheng

for these hyperparameters can then be derived as

−1  −1  αj k k 1 1 μα |• ∼ N + 2 , + 2 , σα2 σα2 σα2 σα¯ σα¯

⎛ μβ |• ∼ N ⎝

k 1 + 2 σβ2 σβ¯



−1 

βj , σβ2

−1 ⎞ k 1 ⎠, + σβ2 σβ2¯

 (αj − μα )2 + ε2 , ∼ IG 2  k (βj − μβ )2 + ζ1 , + ζ2 . σβ2 |• ∼ IG 2 2

σα2 |•

k + ε1 , 2

(0)

(A.10)

(A.11)

(A.12) (A.13)

(0)

Hence, with starting values θ(0) , ξ (0) , γ (0) , μξ and Σξ , observations W (l) , Z (l) , (l) θ , ξ (l) , and γ (l) can be obtained from the Gibbs sampling procedure by iteratively drawing from their respective full conditional distributions specified in (A.1) to (A.5). (l) (l) Similarly, samples for the hyperparameters μξ and Σξ can be drawn from (A.6) to (A.9) assuming uniform priors, or from (A.10) to (A.13) assuming conjugate priors.

REFERENCES Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics, 17, 251–269. Baker, F. B. (1998). An investigation of the item parameter recovery characteristics of a Gibbs sampling approach. Applied Psychological Measurement, 17, 153–169. B´eguin, A. A., & Glas, C. A. W. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66, 541–562. Birnbaum, A. (1968). The Logistic Test Model. In F. Lord, and M. Novick (Eds.), Statistical Theories of Mental Test Scores (pp.397–423). Reading, Mass: Addison-Wesley Publishing Co. Birnbaum, A. (1969). Statistical theory for logistic mental test models with a prior distribution of ability. Journal of Mathematical Psychology, 6, 258–276. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459. Bradow, E. T., Wainer, H., & Wang, X (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168. Browne, W. J., & Draper, D. (2006). A comparison of Bayesian and likelihood-based methods for fitting multilevel models. Bayesian Analysis, 1, 473–514. Carlin, B. P., & Louis, T. A. (2000). Bayes and empirical Bayes methods for data analysis (2nd ed.). London: Chapman & Hall. Chib, S., & Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. The American Statistician, 49, 327–335. Chipman, H., George, E. I., & McCulloch, R. E. (2002). Bayesian treed models. Machine Learning, 48, 299–320. Clark, J. S., Ferraz, G., Oguge, N., Hays, H., & DiCostanzo, J. (2005). Hierarchical Bayes for structured, variable populations: From recapture data to life-history prediction. Ecology,

BAYESIAN HIERARCHICAL MODELING WITH IRT MODELS

39

86, 2232–2244. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. New Jersey: Lawrence Erlbaum Associates, Inc. Fox, J.-P. (2007). Multilevel IRT modeling in practice with the package mlirt. Journal of Statistical Software, 20(5), 1–16. Fox, J.-P. (2010). Bayesian item response modeling: Theory and applications. New York: Springer. Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398–409. Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. Bayesian Analysis, 1, 515–534. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis (2nd ed.). London: Chapman & Hall. Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y.-S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2, 1360–1383. Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences (with discussion). Statistical Science, 7, 457–511. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Analysis and Machine Intelligence, 6, 721–741. Glas, C. W., & Meijer, R. R. (2003). A Bayesian approach to person fit analysis in item response theory models. Applied Psychological Measurement, 27, 217–233. Hall, B. (2012). Bayesian Inference. CRAN. R package version 12.07.02. URL: http://cran. r-project.org/web/packages/LaplacesDemon/index.html. Heath, T. A. (2012). A hierarchical Bayesian model for calibrating estimates of species divergence times. Systematic Biology, 61, 793–809, doi: 10.1093/sysbio/sys032. Johnson, V. E., & Albert, J. H. (1999). Ordinal data modeling. New York: Springer-Verlag. Kim, S.-H., Cohen, A. S., Baker, F. B., Subkoviak, M. J., & Leonard, T. (1994). An investigation of hierarchical Bayes estimation in item response theory. Psychometrika, 59, 405–421. Kwok, H., & Lewis, R. J. (2011). Bayesian hierarchical modeling and the integration of heterogeneous information on the effectiveness of cardiovascular therapies. Circulation: Cardiovascular Qaulity and Outcomes, 4, 657–666. Lee, H. (1995). Markov chain Monte Carlo methods for estimating multidimensional ability in item response analysis. Unpublished dissertation, University of Missouri, Columbia, MO. Lim, R. G., & Drasgow, F. (1990). Evaluation of two methods for estimating item response theory parameters when assessing differential item functioning. Journal of Applied Psychology, 75, 164–174. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillside, New Jersey: Lawrence Erlbaum Associates. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177– 195. Molenaar, I. W. (1995). Estimation of item parameters. In Fischer, G. H., & Molenaar, I. W. (eds.), Rasch models: Foundations, recent developments, and applications (pp.39–51). New York: Springer-Verlag. Nylander, J. A., Wilgenbusch, J. C., Warren, D. L., & Swofford, D. L. (2008). AWTY (Are we there yet?): A system for graphical exploration of MCMC convergence in Bayesian phylogenetics. Bioinformatics, 24, 581–583. Patz, R. J., & Junker, B. W. (1999a). A straightforward approach to Markov chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24,

40

Y. Sheng

146–178. Patz, R. J., & Junker, B. W. (1999b). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, 342–366. Rupp, A. A., Dey, D. K., & Zumbo, B. D. (2004). To Bayes or not to Bayes, from whether to when: Applications of Bayesian methodology to modeling. Structural Equation Modeling, 11, 424–451. Sahu, S. K. (2002). Bayesian estimation and model choice in item response models. Journal of Statistical Computation and Simulation, 72, 217–232. Sheng, Y. (2008). Markov chain Monte Carlo estimation of normal ogive IRT models in MATLAB. Journal of Statistical Software, 25(8), 1–15. Sheng, Y. (2010). A sensitivity analysis of Gibbs sampling for 3PNO IRT models: Effect of priors on parameter estimates. Behaviormetrika, 37, 87–110. Sheng, Y., & Headrick, T. C. (2012). A Gibbs sampler for the multidimensional item response model. ISRN Applied Mathematics. Article 269385, 1–14. Sheng, Y., & Wikle, C. K. (2007). Comparing multiunidimensional and unidimensional item response theory models,” Educational and Psychological Measurement, 67, 899–919. Sheng, Y., & Wikle, C. K. (2008). Bayesian multidimensional IRT models with a hierarchical structure. Educational and Psychological Measurement, 68, 413–430. Sheng, Y., & Wikle, C. K. (2009). Bayesian IRT models in incorporating general and specific abilities. Behaviormetrika, 36, 27–48. Smith, A. F. M., & Roberts, G. O. (1993). Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods. Journal of the Royal Statistical Society. Series B, 55, 3–23. Spiegelhalter, D. J., Thomas, A., & Best, N. G. (1996). Computation on Bayesian graphical models. In J. M. Bernardo, J. O. Berger, A. P. Dawid & A. F. M. Smith (Eds.), Bayesian Statistics 5 (pp.407–425). Oxford: Oxford University Press. Swaminathan, H., & Gifford, J. A. (1982). Bayesian estimation in the Rasch model. Journal of Educational Statistics, 7, 175–192. Swaminathan, H., & Gifford, J. A. (1983). Estimation of parameters in the three-parameter latent trait model. In D. Weiss (Ed.), New horizons in testing (pp.13–30). New York: Academic Press. Swaminathan, H., & Gifford, J. A. (1985). Bayesian estimation in the two-parameter logistic model. Psychometrika, 50, 349–364. Swaminathan, H., & Gifford, J. A. (1986). Bayesian estimation in the three-parameter logistic model. Psychometrika, 51, 581–601. Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distribution by data augmentation (with discussion). Journal of the American Statistical Association, 82, 528–550. Tierney, L. (1994). Markov chains for exploring posterior distributions. The Annals of Statistics, 22, 1701–1762. Tsutakawa, R. K., & Johnson, J. C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55, 371–390. Tsutakawa, R. K., & Soltys, M. J. (1988). Approximation for Bayesian ability estimation. Journal of Educational Statistics, 13, 117–130. Williams, C. L., & Locke, A. (2003). Hyperprior imprecision in hierarchical Bayesian modeling of cluster Bernoulli observations. InterStat: Statistics on the Internet. URL: http://interstat. statjournals.net/YEAR/2003/abstracts/0310001.php. (Received November 17 2011, Revised December 03 2012)

Suggest Documents