MIRT Software Review Kyung (Chris)

MIRT Software Review 1 Running Head: MIRT Software Review

A Review of Commercial Software Packages for Multidimensional IRT Modeling

Kyung (Chris) T. Han Graduate Management Admission Council & Insu Paek Florida State University

Preferred Citation: Han, K. T., & Paek, I. (2014). A review of commercial software packages for multidimensional IRT Modeling. Applied Psychological Measurement, 38(6), 486-498.

Correspondence may be sent to: Kyung T. Han Graduate Management Admission Council 11921 Freedom Dr. Suite 300, Reston, VA 20190 [email protected]

The views and opinions expressed in this article are those of the authors and do not necessarily reflect those of the Graduate Management Admission Council®.

MIRT Software Review 2 Abstract In this study, the authors evaluate several commercially available multidimensional IRT (MIRT) software packages, including IRTPRO 2.1, Mplus 7.1, FlexMIRT, and ESQIRT, as well as their built-in estimation algorithms, and compare them for their performance in MIRT model estimation. The study examines the performance of model parameter recovery via a series of simulations based on four approaches for latent structuring—within-item MIRT, between-item MIRT, a mixture of within- and between- item MIRT and a bifactor model. The simulation studies focused on realistic conditions and models that researchers and practitioners are likely to encounter in practice. The results showed that the studied software packages recovered the item parameters reasonably well but differed greatly in terms of the types of data and models they could handle and also the run time required for estimation completion.

Keywords: MIRT, Computer Program, Simulation

MIRT Software Review 3 Author Contact Information

Kyung (Chris) T. Han

Graduate Management Admission Council 11921 Freedom Dr. Suite 300, Reston, VA 20190 [email protected] [email protected]

Insu Paek Educational Psychology & Learning Systems Florida State University 3204D Stone Building 1114 W. Call St. Tallahassee, FL 32306-4453 (Tel) 850-644-3064 [email protected]

MIRT Software Review 4

A Review of Commercial Software Packages for Multidimensional IRT Modeling12

Introduction The emergence of item response theory (IRT), originally introduced in the 1960s and based mainly on a unidimensional latent structure (Lord & Novick, 1968), completely changed the paradigm of psychological and educational measurement. It made many things possible, computerized adaptive testing, for example, that were not as effective (if not nearly impossible) using classical test theory. One of the main reasons for IRT’s rapid gain in popularity in the field was the availability of computer software tools for estimating various IRT models. This development enabled and invited a wide range of IRT research and applications. With the introduction of early unidimensional IRT models to the field, models generalized for multidimensional latent structures soon followed. The concept of multidimensional IRT (MIRT), the mathematical equivalent to existing factor analysis approaches, had circulated in the field for decades, but had yet to reach critical mass in terms of the number of related research studies and applications. This was largely due to unavailability of computer software tools for MIRT modeling and to insufficient PC computing power. Typical response data in psychological or educational measurement are of a discrete scale that requires mathematical integration processes causing heavy computational loads, especially when dealing with three or more dimensions. The last five years, however, have witnessed great progress in the development of both computer hardware and software. It is now common for PCs to sport multiple CPU cores, and equally common for these newer systems to implement multiple parallel threads with a virtualization technique for computation-heavy tasks. Most important is the development of several new software tools for MIRT modeling designed to take advantage of the latest PCs that are now readily

1

The views and opinions expressed in this article are those of the authors and do not necessarily reflect those of the Graduate Management Admission Council® (GMAC®). 2 The authors are grateful to Paula Bruggeman of GMAC® for editorial review.

MIRT Software Review 5 available to the public. Because these MIRT software tools differ significantly in terms of their performance and user experience, researchers and practitioners could benefit from a comprehensive comparison of these tools to make more informed decisions in choosing the right MIRT software tool for their specific needs. This study presents an evaluation and comparison of the most recent commercially available MIRT software packages and their estimation algorithms. These packages include: IRTPRO 2.1 (Cai, Thissen, & du Toit, 2011), Mplus 7.1 (Muthén, & Muthén, 1998–2012), flexMIRT 2 (Cai, 2013), and EQSIRT (Wu & Bentler, 2013).

Features, Capabilities, and Algorithms Mplus 7.1 Mplus supports a variety of statistical analysis methods including regression and path analysis, exploratory factor analysis (EFA), confirmatory factor analysis (CFA), structural equation modeling (SEM), and mixture modeling. It also supports multigroup and multilevel data. Mplus handles IRT and MIRT models as a special case of CFA and can estimate threshold and slope parameters but not lower asymptote (i.e., pseudo-guessing parameter). In other words, it does not support 3-parameter logistic (3PL) IRT/MIRT models. Mplus can handle both dichotomous and/or polytomous items and also supports mixture and/or multilevel IRT analyses. For MIRT modeling, Mplus offers great flexibility in specifying latent structures and constraints. Users are allowed to impose constrains on a covariance matrix for latent variables and on item parameters. Mplus can handle data with missing responses. For MIRT estimation, available estimators includes the weighted least square (WLSMV), the maximumlikelihood (ML), and Bayes. For the ML estimator, which is the most widely used, Mplus features several different sets of algorithms including Fisher scoring, Newton-Raphson, Quasi-Newton, and ExpectationMaximization (EM). By default, Mplus automatically selects and implements different algorithms for ML during the iterations. Users can choose one of three options for numerical integration: (1) rectangular

MIRT Software Review 6 integration, (2) Gauss-Hermite integration, and (3) Monte Carlo integration. The developers recommend use of rectangular integration whenever possible, but Monte Carlo integration is sometimes the only feasible option when the estimation involves more than three dimensions due to what is known as the “curse of dimensionality.” For estimating person score (i.e., factor score), it uses the expected a posteriori (EAP) estimation method. Monte Carlo simulations and multiple imputations can be done within the software. Mplus supports multi-threading for the ML estimator and it can reduce the time required to complete estimation by a large factor when the computer has multiple CPU cores available.

IRTPRO 2.1 IRTPRO 2.1, although still offering some features for EFA, mainly focuses on IRT/MIRT modeling. IRTPRO is based on a highly generalized IRT model that embraces multiple groups, multiple response categories, and multiple dimensions. As a result, it can cover a range of models, from a unidimensional 3PL model with a pseudo-guessing parameter to multidimensional IRT models for multiple groups with polytomous responses based on the graded response model (GRM), the generalized partial credit model (GPCM), or the nominal response model (NRM). Users can impose constraints or prior values on any of the item parameters with IRTPRO, and the software can handle data with missing responses. There is no maximum limit on the number of items that IRTPRO can handle, however it is essentially limited by the system’s usable memory and IRTPRO’s maximum manageable memory with a 32-bit architecture (theoretically up to 4GB). For item parameter estimation, IRTPRO offers three different methods: the Bock-Aitkin approach with the expectation-maximization algorithm (BAEM; Bock & Aitkin, 1981), the adaptive quadrature approach (ADQ; Schilling & Bock, 2005) with three different options for numerical integrations (GaussHermite, Monte Carlo, and Latin Hypercube), and the Metropolis-Hastings Robbins-Monro Method (MHRM; Cai, 2010a, 2010b). MHRM is an optimization method that produces MML and modal Bayes solutions for item factor models and multilevel item factor model. It eschews numerical integration and combines elements of the Makov-Chain Monte Carlo (MCMC) using the Metropolis-Hasting method

MIRT Software Review 7 with stochastic approximation (the Robbins-Monro method) to achieve a pointwise convergent algorithm. It produces standard errors as a by-product. Generally the only tuning parameter for the MHRM algorithm is the proposal dispersion constant for the MH sampler. For score estimation, IRTPRO supports three types of estimation methods: EAP, summed score EAP (SSEAP), and the maximum a posteriori (MAP) method and also provides differential item functioning (DIF) analysis tools. IRTPRO supports the multithreading, boosting its computation speed on computers that have multiple CPU cores.

flexMIRT 2.0 flexMIRT supports the same unidimensional and multidimensional IRT models that IRTPRO does and also capable of handling multilevel structures; the latest version (2.0) also supports the extended cognitive diagnostic models (CDM). There are two different editions of flexMIRT that work with different Windows systems—one based on a 32-bit architecture and another based on a 64-bit architecture. For the 64-bit version, there is practically no maximum limit in the size of computer memory that flexMIRT can use (Windows 8 Pro supports up to 512 GB RAM). For item calibration, users can choose either BAEM or MHRM methods, and, for score estimation, they have the option of using ML, EAP, SSEAP, Weighted SSEAP, or MAP. flexMIRT also can estimate the item parameters and theta distribution simultaneously by using empirical histogram (currently only for single level bifactor or testlet response models). Tools for DIF analyses, multiple imputation, and simulations are also available within flexMIRT. A unique feature of flexMIRT is the “fixed effects calibration,” in which item parameters can be estimated given fixed values for individual thetas. This feature can be very useful for practitioners who need to calibrate pretest items using a priori calibrated individual theta values. flexMIRT also supports multithreading operations with two possible options for efficiency tuning depending on the number of dimensions and the number of items.

EQSIRT 1.0

MIRT Software Review 8 EQSIRT supports various unidimensional and multidimensional IRT models with both dichotomous and polytomous response types as well as latent class analysis (LCA) and Mokken scale analysis (MSA). EQSIRT offers three estimation algorithms for item calibration: (1) the marginal maximum likelihood (MML) method, (2) the Monte-Carlo Expectation-Maximization (MCEM) method, and (3) the MCMC method. Not all estimation methods work with all IRT models, however. For example, MCEM and MCMC do not work for estimating the 3PL model in EQSIRT. For theta estimation, the software offers ML, EAP, and MAP methods. EQSIRT offers a wide variety of tools for IRT-related research including DIF analysis, test score equating, and simulation. EQSIRT cannot handle a response matrix with missing data, and does not support multithreading computing. EQSIRT can handle a maximum number of 200 items for MIRT calibration.

Performance A series of simulation studies were conducted to evaluate and compare the performance of MIRT parameter recovery across the studied software packages. Simulations Response matrix data with 30 dichotomous items loading on three or four factors (i.e., latent variables) were simulated based on the multidimensional compensatory 2PL IRT model under four different latent structures. The diagrams of the four models are represented in Figure 1. Model 1 was a so-called “between-item” structure, in which each item loaded on a single factor. There were three factors, and each factor was exclusively associated with 10 items. The true item parameter values were borrowed from a real, existing item bank for a testing program in higher education. The mean and standard deviation (SD) of the true slope parameter values of the 30 items were 1.22 and 0.48, respectively. To prevent potential scaling issues, all true threshold parameter values were rescaled to follow a standard normal distribution.

MIRT Software Review 9

Insert Figure 1 About Here Model 2 was a so called “within-item” structure in this study, where some items loaded on more than a single factor. In Model 2, the first 10 items loaded only on the first factor, F1, and Items 11 to 20 loaded on both F1 and F2. The final 10 items (Items 21 to 30) loaded on all three factors (F1, F2, and F3). The mean and SD of the true a-parameter values across factors were 0.88 and 0.47, respectively, and the true b-parameter values were the same as those seen with Model 1. Model 3 was a mixture of “between” and “within” structures. Items 1 to 10 loaded only on F1, Items 11 to 15 only on F2, and Items 21 to 25 only on F3. The remaining items loaded on multiple factors. Model 4 was a bi-factor model in which there was one primary factor (F1) on which all 30 items loaded. Each item also loaded on one of three nuisance factors (F2, F3, and F4). Under the four studied models, the response data for 3,000 simulees were generated. For Models 1, 2, and 3, the true theta values with three dimensions were borrowed from real score data derived from a higher education testing program with three different subjects. The true theta values of each dimension were rescaled to follow a standard normal distribution, and as a result, the true covariance matrix values were identical to the correlation matrix values. The covariance values were 0.19 for F1 and F2, 0.46 for F1 and F3, and 0.35 for F2 and F3. For Model 4, the true theta values for four factors (1 main + 3 nuisance) were randomly generated from a standard normal distribution, and in the covariance matrix, all diagonal values were 1, and all off-diagonal values were 0. In addition to the aforementioned study conditions, Models 1, 2, and 4, also were replicated with missing response data. For the missing data conditions, the same 3,000 simulee data were used. Each simulee was administered with 30 items randomly selected from a total of 90 items. The simulation was performed using WinGen software (Han, 2007). Calibration

MIRT Software Review 10 A majority of available estimation methods were used with the Mplus, IRTPRO, flexMIRT, and EQSIRT software. For Mplus, the ML and MC methods were used. With IRTPRO, the BAEM, ADQ, and MHRM methods were used, and the BAEM and MHRM methods were employed with flexMIRT. The MML, MCEM, and MCMC options were used with EQSIRT. All item parameters (slope and threshold) that loaded on corresponding factors according to each model design (Figure 1) were set to be freely estimated. The slope parameters of items that did not load on factors were fixed at zero. To avoid the scale indeterminacy issue, the mean values of latent trait distributions were set to 0, and the variance values of the latent trait distributions were fixed at 1. For Models 1, 2, and 3, all off-diagonal values of the covariance matrix for the latent traits were set to be freely estimated. For Model 2, the calibration was done twice under an additional condition where all off-diagonal values of the covariance matrix were fixed to the true values (called it Model 2B). For Model 4, all off-diagonal values of the covariance matrix were fixed to zero during calibration, and for IRTPRO and flexMIRT, the commands for bifactor modeling were used, where dimension reduction techniques were employed. The calibration performance of the studied software programs was evaluated based on Pearson correlation coefficients between the true values and estimates, estimation bias, and mean absolute error statistics. For models where the covariance matrix was freely estimated, the difference between the true values and estimates also were compared. All calibrations were run on a computer with Intel i7-2760QM CPU with quad cores and 8 GB of physical RAM. The operating system was Microsoft Windows 7 Professional 64bit edition. Except for EQSIRT, which did not support multithread computing, all studied programs were set to use up to two CPU cores during calibration. The actual elapsed time (not CPU time) for each calibration run was recorded. The study focused on evaluating item parameter recovery and did not evaluate the person estimates. Results and Performance Comparison Conditions with 30 items without missing responses were evaluated first. Except for the MCEM and MCMC methods with EQSIRT, all studied software programs with different estimation methods showed an extremely high level of parameter recovery performance for both slope and threshold parameters under

MIRT Software Review 11 Models 1, 2B, 3, and 4. The correlation between the true and estimated parameter values was higher than 0.96 for slope and higher than 0.99 for threshold. EQSIRT with the MCEM and MCMC methods failed to finish calibration runs for all studied Models.3 As shown in Table 1, the estimation bias and errors (measured using MAE) were very small for Models 1, 2B, and 3. The off-diagonal values of the covariance matrix were reasonably well recovered for Models 1 and 3.

Insert Table 1 About Here

Model 2 (the within-item design) was technically a hierarchical factor model with three layers, and the latent covariances were expected to be unidentifiable under the studied conditions. Contrary to our expectation, however, all studied software tools finished calibration runs without encountering any errors when the latent covariance matrix was freely estimated (Model 2A). As reported in Table 1, however, the estimated latent covariance values from all calibration runs were incorrect because of a rotational indeterminacy. In these circumstances, the software tools should have either stopped running without producing outputs with wrong estimates for unidentifiable parameters or, at the very least, provided error or warning messages so that users could avoid being misled by the meaningless estimates for unidentifiable parameters. None of the studied programs, however, provided such a feature or guidance. The accuracy of slope parameter estimates under Model 2A across the studied programs was moderately degraded from that observed for Model 1. This was due mainly to increased estimation biases resulting from covariance estimates were incorrect. The parameter estimates for threshold under Model 2A was still very close to the true values for all studied programs. Under Model 2B, where all covariance matrix values were fixed to the true values, all programs except EQSIRT showed an extremely high level of parameter recovery performance for both slope and threshold. EQSIRT offers a syntax command to fix latent covariance matrix values, but it turned out to be not working; the program recognized the command

3

At the time of this writing, the developers of EQSIRT are aware of the issue and reportedly are working to fix the problems.

MIRT Software Review 12 but ignored it. As a result, the latent covariance matrix values for Model 2B were not fixed but freely estimated with EQSIRT as in Model 2A.4 Item calibration performance in the event of missing data was also evaluated, and the results are reported in Table 2. Only Models 1, 2B, and 4 were studied, using a total of 90 items, although each simulee was administered with only 30 items chosen at random. Therefore, with about 1,000 observed responses per item, the sparseness of the response matrix (i.e., percentage of missed responses) was 66.6%. Although it showed options for handling missing data in its graphic interface, EQSIRT could not handle missing data for IRT calibration; all other programs ran successfully, however. As shown in Table 2, all studied programs except EQSIRT showed correlations between the true and estimated values that were higher than 0.99 for threshold and 0.93 for slope, which was slightly lower than the non-missing data conditions shown in Table 1. There was practically no bias in estimating the slopes and thresholds when the data had missing responses, and the estimation errors (MAE) were slightly larger than what was observed in Table 1. This was not unexpected given the decrease in the number of response data available for each item (3,000 responses under the non-missing data conditions, compared with about 1,000 responses under the missing data conditions).

Insert Tables 2 and 3 About Here

The elapsed time for item calibrations with the studied programs with different selection of estimation methods varied greatly as shown in Table 3. Each program has different default settings and tunings for each of the studied estimation methods (for example, different maximum number of cycles for iterative calibrations processes, different criteria for convergence, different number of quadrature points, etc.). Having said that, direct comparisons on the elapsed times under the conditions studied with the program default convergence setting may not be a critically meaningful measure for evaluating performance of the programs, but they still offer helpful information. Among the compared programs,

4

This bug was reported to the software developers.

MIRT Software Review 13 EQSIRT with the MML estimation method took the longest time across all studied conditions, likely because EQSIRT was the only one of the studied programs that did not support multithread processing. The BAEM method implemented within IRTPRO and flexMIRT tended to take more time than the similar ML method implemented in Mplus. The MHRM method with IRTPRO achieved the shortest elapsed time (less than 4 minutes) in all studied conditions/models. The same MHRM method implemented within flexMIRT tended to take slightly more time than IRTPRO. The MC method of Mplus also displayed very fast performance speed (less than 10 minutes) in all studied conditions. The ADQ method within IRTPRO was the fastest (under 19 minutes) among non-MC based estimation methods.

User Interface and Documentation Although Mplus features a basic syntax generator via dialog boxes, the main user interface (UI) of Mplus is not much more than a text editor, in which users directly write syntax commands and edit input files. Mplus supports various text-based data formats and can be run easily in a batch mode. Most outputs (except for the factor scores (i.e., theta estimates)) are saved in a single file, which sometimes makes it difficult to extract important results such as item parameter estimates. Error messages of Mplus are usually sufficiently detailed to enable users to locate the problems. Mplus also provides useful warning notes in the output when users may need to use cautions for interpreting the results. IRTPRO offers a complete point-and-click UI as well as a text editor for editing syntax commands. IRTPRO only supports its own proprietary data file format (*.ssig) to run the software. Although IRTPRO provides features for importing data from other file formats, the lack of capability to work directly on various file formats can limit the usability of the software especially when it is set to run in a batch mode for massive data analyses. For outputs, IRTPRO allows users to save several key results in separate files. The main output is presented in a well-organized html format. Like IRTPRO, EQSIRT has a complete point-and-click solution for user interfacing. The main interface of EQSIRT is well organized and highly flexible so that users can easily work on multiple projects simultaneously. The syntax editor is very user-friendly with automatic color coding on the syntax

MIRT Software Review 14 commands. EQSIRT has its own proprietary format for data files, but users still can directly access fixed format data without converting them. The main output is provided in both text format and html format, with xml format also available. Error messages provided in dialog boxes did not seem to be helpful in identifying or fixing problems in syntax and data. During this study, the program often stopped running without displaying any error messages. Similar to Mplus, flexMIRT mostly has a syntax-based interface with no real graphical user interface elements. flexMIRT has no feature to generate syntax commands; users must write syntax commands on their own. The program can natively handle data in space-, comma-, or tab- delimited formats but does not support fixed format. Users have options to save key results in separate output files. The program provides simple error messages that were useful for identifying and fixing problems with syntax and/or data. All studied programs come with user manuals and output examples. Mplus offers the most comprehensive manual contents including hundreds of examples for most of the models and analyses that the program is designed to handle, but the contents covering IRT modeling are very limited. Mplus has a large user forum where thousands of examples, discussions, and Q&A can be found. The software’s technical support is also strong; generally most questions are answered within 24 hours. The other programs—IRTPRO, EQSIRT, and flexMIRT— have much shorter histories in the field, and fewer resources except for examples provided in the manual and technical support.

Availability and Price Mplus and EQSIRT are available for Windows, Mac OS X, and Linux platforms; IRTPRO and flexMIRT are available only for Windows systems. IRTPRO requires administrator operation every time it is run; otherwise, it does not start. This can be a serious limitation for users who need to work on company-owned and maintained computers because most users in such work environment do not

MIRT Software Review 15 normally have administrator privilege. Versions with the 64-bit architecture for 64-bit Windows systems are available for Mplus and flexMIRT. Program prices vary significantly. For academic users, a single user license costs $595 for Mplus, $495 for IRTPRO (for a single installation), and $595 for EQSIRT. Student versions of Mplus and EQSIRT are available for less than $200. flexMIRT does not offer a permanent license option; instead being a subscription-based product costing $125 per year (up to three installations) for academic users. All prices are subject to change. All programs also offer a wide range of pricing options for rental and commercial use, while offering trial versions that have limited time use or limited features.

Conclusion All MIRT software packages that were evaluated and compared in this study are capable of estimating MIRT models. The results from the parameter recovery study with simulation suggest that they all worked very well with a high level of estimation accuracy (except for Model 2A, where the covariance matrix for latent variables was not identifiable unless additional constraints were imposed). Each program, however, has its own unique favorable features and shortcomings. Regarding limitations of the programs evaluated here, Mplus cannot calibrate models with lower asymptote parameters (i.e., pseudoguessing). IRTPRO cannot natively handle text-based data formats, EQSIRT cannot handle data with missing responses, and flexMIRT has a syntax-only user interface, which may not be easy for new learners. Therefore, it is important for users to understand the different advantages and limitations across the programs before they choose a MIRT software package for their purposes.

MIRT Software Review 16 References

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: An application of the EM algorithm. Psychometrika, 46, 443–459. Cai, L. (2010a). High-dimensional exploratory item factor analysis by a Metropolis-Hasings Robbins-Monro algorithm. Psychometrika, 75, 33–57. Cai, L. (2010b). Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis. Journal of Educational and Behavioral Statistics, 35, 30–335. Cai, L. (2013). flexMIRT: A numerical engine for flexible multilevel multidimensional item analysis and test scoring (Version 2.0). [Computer software]. Chapel Hill, NC: Vector Psychometric Group. Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO for Windows [Computer software]. Lincolnwood, IL: Scientific Software International. Han, K. T. (2007). WinGen: Windows software that generates IRT parameters and item responses. Applied Psychological Measurement, 31, 457–459. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading MA: Addison-Wesley Publishing Company. Muthén, L.K. and Muthén, B.O. (1998–2012). Mplus user’s guide (7th ed.). Los Angeles, CA: Muthén & Muthén. Schilling, S., & Bock, R. D. (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature. Psychometrika, 70, 533–555. Wu, E. J. C. & Bentler, P. M. (2013). EQSIRT: A comprehensive item response theory program [Computer software]. Encino, CA: Multivariate Software, Inc.

MIRT Software Review 17 Table 1. Item Parameter Recovery When Without Missing Data (30 Items / 3,000 Simulees) Correlation

Model 1 (Between)

Model 2A2) (Within)

`Model 2B3) (Within)

Model 3 (between + within)

Model 4 (Bi-Factor)

Bias

MAE

Slope

Threshold

Slope

Threshold

Slope

Threshold

Mplus (ML)

0.986

0.998

0.041

0.016

0.070

0.054

Covariance Matrix Cov(a1,a2) Cov(a1,a3) Cov(a3,a3) True=0.19 True=0.46 True=0.35 0.217 0.456 0.344

Mplus (MC)

0.986

0.998

0.048

0.017

0.073

0.054

0.227

0.473

0.351

EQSIRT (ML)

0.986

0.998

0.041

0.016

0.070

0.054

0.217

0.457

0.344

IRTPRO (BAEM)

0.986

0.998

0.041

0.016

0.070

0.054

0.217

0.457

0.344

IRTPRO (ADQ)

0.986

0.998

0.041

0.014

0.070

0.053

0.217

0.457

0.344

IRTPRO (MHRM)

0.986

0.998

0.039

0.021

0.070

0.054

0.221

0.461

0.356 0.367

FlexMIRT (BAEM)

0.986

0.998

0.041

0.016

0.070

0.054

0.265

0.476

FlexMIRT (MHRM)

0.986

0.998

0.038

0.017

0.069

0.055

0.213

0.432

0.330

Mplus (ML)

0.913

0.998

0.157

0.041

0.196

0.067

-0.023

0.160

-0.225

Mplus (MC)

0.960

0.998

0.082

0.037

0.127

0.065

0.155

0.276

-0.125

EQSIRT (ML)

0.930

0.998

0.123

0.041

0.159

0.068

0.064

0.523

-0.340

IRTPRO (BAEM)

0.888

0.998

0.192

0.041

0.231

0.067

-0.072

0.085

-0.351

IRTPRO (ADQ)

0.822

0.998

0.263

0.038

0.302

0.066

-0.228

0.006

-0.281

IRTPRO (MHRM)

0.905

0.998

0.174

0.005

0.211

0.056

-0.025

0.058

-0.336

FlexMIRT (BAEM)

0.888

0.998

0.192

0.041

0.231

0.068

0.240

-0.160

-0.232

FlexMIRT (MHRM)

0.903

0.998

0.163

0.013

0.206

0.059

0.077

0.011

0.003

Mplus (ML)

0.980

0.998

0.036

0.041

0.086

0.068

-

-

-

Mplus (MC)

0.976

0.998

0.037

0.037

0.090

0.065

-

-

-

EQSIRT (ML) 1)

0.930

0.998

0.123

0.041

0.159

0.068

0.0641)

0.5231)

-0.3401)

IRTPRO (BAEM)

0.980

0.998

0.036

0.041

0.086

0.067

-

-

-

IRTPRO (ADQ)

0.980

0.998

0.037

0.037

0.086

0.066

-

-

-

IRTPRO (MHRM)

0.977

0.998

0.042

0.005

0.088

0.056

-

-

-

FlexMIRT (BAEM)

0.980

0.998

0.036

0.041

0.086

0.067

-

-

-

FlexMIRT (MHRM)

0.978

0.998

0.041

0.009

0.084

0.058

-

-

-

Mplus (ML)

0.992

0.999

0.051

0.037

0.066

0.054

0.187

0.439

0.287

Mplus (MC)

0.991

0.998

0.052

0.040

0.070

0.056

0.215

0.492

0.305

EQSIRT (ML)

0.992

0.999

0.051

0.037

0.066

0.054

0.188

0.439

0.287

IRTPRO (BAEM)

0.992

0.999

0.051

0.037

0.066

0.054

0.187

0.439

0.287

IRTPRO (ADQ)

0.992

0.999

0.051

0.035

0.066

0.053

0.187

0.430

0.282

IRTPRO (MHRM)

0.992

0.999

0.051

0.033

0.067

0.052

0.190

0.413

0.288

FlexMIRT (BAEM)

0.992

0.999

0.051

0.037

0.066

0.054

0.187

0.438

0.286

FlexMIRT (MHRM)

0.992

0.999

0.058

0.033

0.072

0.053

0.181

0.341

0.215

Mplus (ML)

0.980

0.998

0.104

0.033

0.121

0.063

Mplus (MC)

0.974

0.998

0.102

0.028

0.125

0.062

EQSIRT (ML)

0.983

0.999

-0.005

-0.006

0.072

0.043

IRTPRO (BAEM)

0.983

0.999

-0.005

-0.007

0.072

0.042

IRTPRO (ADQ)

0.983

0.999

-0.004

-0.009

0.072

0.042

IRTPRO (MHRM)

0.964

0.999

0.036

-0.010

0.086

0.043

FlexMIRT (BAEM)

0.983

0.999

-0.005

-0.007

0.072

0.042

FlexMIRT (MHRM)

0.972

0.999

0.034

-0.009

0.081

0.042

1) Not worked as intended. Set to fix the covariance matrix values but EQSIRT did not fix them. 2) In Model 2A, the covariance matrix was freely estimated. 3) In Model 2B, the covariance matrix was not estimated but fixed to the true values.

MIRT Software Review 18 Table 2. Item Parameter Recovery When With Missing Data (30 Items / 3,000 Simulees / about 1,000 Responses per Item) Correlation

Model 1 (Between)

Model 2B (Within B)

Model 4 (Bi-Factor)

Bias

MAE

Covariance Matrix

Slope

Threshold

Slope

Threshold

Slope

Threshold

Mplus (ML)

0.948

0.996

0.042

0.004

0.110

0.068

Cov(a1,a2) True=0.19 0.221

Cov(a1,a3) True=0.46 0.448

Cov(a3,a3) True=0.35 0.349

Mplus (MC)

0.948

0.996

0.046

0.005

0.111

0.068

0.233

0.464

0.357

IRTPRO (BAEM)

0.948

0.996

0.042

0.004

0.110

0.068

0.220

0.450

0.350

IRTPRO (ADQ)

0.948

0.996

0.042

0.003

0.110

0.068

0.221

0.448

0.349

IRTPRO (MHRM)

0.947

0.996

0.042

0.003

0.111

0.068

0.222

0.445

0.351

FlexMIRT (BAEM)

0.948

0.996

0.042

0.004

0.110

0.068

0.221

0.448

0.349

FlexMIRT (MHRM)

0.949

0.996

0.040

0.004

0.108

0.067

0.214

0.422

0.332

Mplus (ML)

0.932

0.997

0.033

0.006

0.125

0.069

-

-

-

Mplus (MC)

0.933

0.997

0.028

0.001

0.124

0.070

-

-

-

IRTPRO (BAEM)

0.932

0.997

0.033

0.007

0.125

0.069

-

-

-

IRTPRO (ADQ)

0.932

0.997

0.033

0.004

0.125

0.069

-

-

-

IRTPRO (MHRM)

0.931

0.997

0.035

-0.002

0.125

0.069

-

-

-

FlexMIRT (BAEM)

0.932

0.997

0.033

0.007

0.125

0.069

-

-

-

FlexMIRT (MHRM)

0.932

0.997

0.033

0.000

0.126

0.069

-

-

-

Mplus (ML)

0.933

0.997

0.028

0.000

0.121

0.067

-

-

-

Mplus (MC)

0.934

0.997

0.025

0.001

0.122

0.066

-

-

-

IRTPRO (BAEM)

0.933

0.997

0.028

0.000

0.121

0.067

-

-

-

IRTPRO (ADQ)

0.933

0.997

0.028

-0.002

0.121

0.067

-

-

-

IRTPRO (MHRM)

0.930

0.997

0.029

0.001

0.122

0.067

-

-

-

FlexMIRT (BAEM)

0.932

0.997

0.028

-0.001

0.121

0.067

-

-

-

FlexMIRT (MHRM)

0.931

0.997

0.029

-0.003

0.122

0.067

-

-

-

MIRT Software Review 19 Table 3. Elapsed Time for Item Calibration (Hour: Minute) Data

3,000 Simulees

Condition

Mplus (ML)

Mplus (MC)

EQSIRT (MML)

IRTPRO (BAEM)

IRTPRO (ADQ)

IRTPRO (MHRM)

FlexMIRT(BAEM)

FlexMIRT(MHRM)

Model 1

0:26

0:01

3:04

2:08

0:02

0:01

0:44

0:02

Model 2A

0:14

0:06

9:38

3:48

0:04

0:03

1:05

0:04

2:53

0:05

0:01

2:00

0:12

1)

30 Items

Model 2B

0:14

0:05

10:23

No Missing Data

Model 3

0:07

0:01

3:16

2:30

0:03

0:01

1:26

0:03

Model 4

0:05

0:05

9:39

0:03

0:15

0:03

0:02

0:07

3,000 Simulees

Model 1

0:25

0:02

Not supported

3:56

0:02

0:02

0:56

0:03

90 Items

Model 2B

0:37

0:09

Not supported

12:48

0:05

0:03

1:16

0:14

67% Missing Data

Model 4

0:13

0:05

Not supported

0:15

0:18

0:03

0:01

0:09

1) Not worked as intended. Set to fix the covariance matrix values but EQSIRT did not fix them.

MIRT Software Review 20 Figure 1. Factor Structure and Items for Simulations

Model 1 3 Dimensional IRT (Between)

Model 2 3 Dimensional IRT (Within)

F1

F2

F3

F1

F2

F3

1-10

11-20

21-30

1-30

11-30

21-30

Model 3 3 Dimensional IRT (between + within)

Model 4 4 Dimensional Bi-Factor Model 21-30

11-15

F4

21-25

F1

F2

F3

1-10

16-20

26-30

F1

11-20

1-10

F3

F2