PSYCHOMETRIKA
2012 DOI : 10.1007/ S 11336-012-9265-5
MULTIDIMENSIONAL CAT ITEM SELECTION METHODS FOR DOMAIN SCORES AND COMPOSITE SCORES: THEORY AND APPLICATIONS
L IHUA YAO DEFENSE MANPOWER DATA CENTER, MONTEREY BAY Multidimensional computer adaptive testing (MCAT) can provide higher precision and reliability or reduce test length when compared with unidimensional CAT or with the paper-and-pencil test. This study compared five item selection procedures in the MCAT framework for both domain scores and overall scores through simulation by varying the structure of item pools, the population distribution of the simulees, the number of items selected, and the content area. The existing procedures such as Volume (Segall in Psychometrika, 61:331–354, 1996), Kullback–Leibler information (Veldkamp & van der Linden in Psychometrika 67:575–588, 2002), Minimize the error variance of the linear combination (van der Linden in J. Educ. Behav. Stat. 24:398–412, 1999), and Minimum Angle (Reckase in Multidimensional item response theory, Springer, New York, 2009) are compared to a new procedure, Minimize the error variance of the composite score with the optimized weight, proposed for the first time in this study. The intent is to find an item selection procedure that yields higher precisions for both the domain and composite abilities and a higher percentage of selected items from the item pool. The comparison is performed by examining the absolute bias, correlation, test reliability, time used, and item usage. Three sets of item pools are used with the item parameters estimated from real live CAT data. Results show that Volume and Minimum Angle performed similarly, balancing information for all content areas, while the other three procedures performed similarly, with a high precision for both domain and overall scores when selecting items with the required number of items for each domain. The new item selection procedure has the highest percentage of item usage. Moreover, for the overall score, it produces similar or even better results compared to those from the method that selects items favoring the general dimension using the general model (Segall in Psychometrika 66:79–97, 2001); the general dimension method has low precision for the domain scores. In addition to the simulation study, the mathematical theories for certain procedures are derived. The theories are confirmed by the simulation applications. Key words: BMIRT, CAT, domain scores, Kullback–Leibler, MCAT, multidimensional item response theory, multidimensional information, overall scores.
Multidimensional item response theory (MIRT) has a promising future in both the traditional paper-and-pencil (PP) administration format and the computer-adaptive format (CAT). MIRT, which borrows information between each dimension or domain, increases the precision and the reliability for the domain scores (Haberman & Sinharay, 2010; Yao & Boughton, 2007). Multidimensional computer-adaptive testing (MCAT) can provide higher precision and reliability or reduce test length when compared with unidimensional CAT (Segall, 1996). Several papers on MCAT were reviewed, including Li and Schafe (2005) and Luecht (1996). Segall (1996) proposed an item selection method by maximizing the volume or the determinant of the Bayesian information function for the domain scores. Veldkamp and van der Linden (2002) proposed an algorithm that selects the next item by maximizing the posterior expected Kullback–Leibler information (KL). Reckase (2009) summarized these item selection procedures and compared them with a method that selects an item that maximizes the information in the direction with the minimum information. The overall score or composite score for some domain scores or subskills are often reported and can provide overall achievement or assessment for a student. These scores can be used as a qualifier for a program or a predictor of future performance. Relationships between the domain Requests for reprints should be sent to Lihua Yao, Defense Manpower Data Center, Monterey Bay, 400 Gigling Rd., Seaside, CA 93955-6771, USA. E-mail:
[email protected]
© 2012 The Psychometric Society
PSYCHOMETRIKA
scores and overall score from the MIRT models have been studied for traditional test data (De la Torre & Hong, 2010; Yao, 2010a). For the overall scores in the MCAT framework, Segall (2001) investigated the MIRT general model in selecting items for the general ability; however, the domain scores for this model cannot be estimated accurately. van der Linden (1999) used methods that minimized the error variance of the linear combination of the domain abilities for the purpose of increasing the precision of the composite score. The overall score was obtained by averaging the domain scores. However, simply averaging the scores from different content areas ignores the fact that (a) different content areas have different maximum raw score points; (b) scores from those content areas are related; and (c) at different score points the relationship between composite scores and content scores may be different. One of the CAT item selection procedures in this study proposes using the MIRT information method for the overall score. This method takes into account all of the above concerns. At different score points, the weight for deriving the overall scores is different, and the overall score has the smallest standard error of measurement among all possible overall scores at a particular score point; such obtained overall score is found to be a better predictor for an examinee’s overall ability (Yao, 2010a). This procedure will be described in section MCAT Item Selection Procedures. As with Yao (2010a), this study focuses on the item selection procedures in the MCAT framework for both the domain scores and the overall scores. It compares the results of five item selection procedures with four existing methods and one newly proposed method, and investigates the effects of the following four factors through simulation: (a) the structure of item pools, (b) the population distribution of simulees, (c) the number of items selected, and (d) the content area. The five procedures described in the MCAT Item Selection Procedures section are also compared with another procedure, G, for the general dimension (Segall, 2001) for the overall score. The procedures are compared by examining the (1) absolute bias (ABSBIAS), (2) correlations, (3) time used in selecting items and ability estimates, (4) test reliability, and (5) item usage, with the purpose of finding a procedure that yields higher precision for both the domain and composite abilities, and a higher item usage. Three sets of item pools are composed with the item parameters estimated from real-live CAT data. Other factors, such as control for the exposure rate while maintaining the same precision for both the domain and the overall scores, will be considered in a future study. All the item selection procedures in this study have been implemented in the computer language Java. Besides the simulation study, the mathematical theory for some of the procedures is derived. 1. MIRT Models Following the notation of the MIRT model in Yao and Schwarz (2006), for a dichotomouslyscored item j , the probability of a correct response to item j for an examinee with the ability θi = (θi1 , . . . , θiD ) for the multidimensional three-parameter logistic (M-3PL; Reckase, 1997) model is 1 − β3j Pij 1 = Pj 1 (θi ) = P (xij = 1 | θi , βj ) = β3j + , (1) (−β2j θiT +β1j ) 1+e where xij = 0 or 1 is the response of examinee i to item j , β2j = (β2j 1 , . . . , β2j D ) is a vector of dimension D for the item discrimination parameters,
β1j β2j
is the intercept or the difficulty β θ . parameter, β3j is the lower asymptote or the guessing parameter, and β2j θiT = D l=1 2j l il D 2 The norm or MDISC (Reckase & McKinely, 1991) is defined as β2j = l=1 β2j l . The parameters for the j th item are βj = (β2j , β1j , β3j ). Note that only the multidimensional threeparameter logistic model is introduced here, as the items in the simulation study are all multiple choice items.
LIHUA YAO
Let 1(X
Pij = Pj (θi ) = Pij (Xij | θi , βj ) = Pij 1 ij for an M-3PL item, where
1 1(Xij =k) = 0
=1)
1(Xij =0)
(1 − Pij 1 )
(2)
if Xij = k . otherwise
= (X1 , . . . , XJ ) If the subscript i is dropped, the likelihood equation for a response to J items X at a given ability θ is | θ) | θ, β) = = P (X L(X
J
Pj (Xj | θ, βj ).
(3)
j =1
Let Pj 1 (θ), as in Equation (1), indicate the probability at ability θ. 1.1. Item and Test Information Function The test information function is obtained by following Yao and Schwarz (2006). For an item j , following M-3PL, the information function at θ is Ij (θ) =
(Pj 1 − β3j )2 (1 − Pj 1 ) β2j ⊗ β2j . Pj 1 (1 − β3j )2
(4)
Here ⊗ is a vector product; β2j ⊗ β2j is a D × D matrix, and its mth row and nth column element is the product of the mth and nth element of β2j . The test information for J items at θ is IJ (θ) = Jj=1 Ij (θ). The directional information in the direction α in the multidimensional θ space is cos( α )IJ cos( α )T , where cos( α ) = (cos α1 , . . . , cos αD ), and αi is the angle between vector θ and the θi axis. 1.2. Bayesian Statistics Suppose that the prior of the population is f (θ). Then the posterior density function of θ is
and
= L(X | θ) f (θ ) , f (θ | X) f (X)
(5)
−1/2 1 f (θ) = (2π)−D/2 || T −1 (θ − μ) exp − (θ − μ) , 2
(6)
where μ and represent the population mean and variance-covariance matrix, respectively. First Derivative | θ) ∂ log L(X ∂(θ − μ) −1 ∂ log f (θ | X) = − (θ − μ), ∂θ ∂θ ∂θ where ∂(θ − μ) = (0, . . . , 1, 0, . . . , 0)1×D , ∂θk and 1 is in the kth position.
(7)
PSYCHOMETRIKA
Second Derivative | θ) ∂ 2 log L(X ∂ 2 log f (θ | X) = − −1 = J (θ) − −1 . 2 2 ∂θ ∂θ
(8)
Item and Test Information Function Posterior test information at θ for selected j − 1 items is Ij −1 (θ) = −EJ (θ) + −1 =
j −1 (Pm1 − β3m )2 (1 − Pm1 ) β2m ⊗ β2m + −1 . Pm1 (1 − β3m )2
(9)
m=1
Here bold variable Ij −1 indicates the sum of I1 , . . . , Ij −1 . 2. MCAT Item Selection Procedures MIRT item parameters can be of complex structure (i.e., items load on more than one dimension) or of simple structure (i.e., items load on only one dimension). There are rationales for both types of items. For many testing companies, items are designed to measure certain objectives to report objective level scores in addition to overall score. MIRT simple structured item parameters for reporting domain scores and overall scores are of importance. Research has shown the advantages of simple structured items over complex structured items (Luecht & Miller, 1992; Yao & Boughton, 2009). Mulder & van der Linden (2009) observed that items with large discrimination parameters for more than one ability are generally not informative. Lee, Ip, and Fuh (2008) also showed that for a two-dimensional test, information increases as the index based on the absolute difference of the two discriminations increases; results for this index were at least comparable or even superior than those using MDISC. These two papers support the use of items of essentially unidimensional models—which are similar to simple structured models. For the current study, the item pool consists of items of simple structure. The theorem and corollaries below provide the foundation for understanding some of the CAT item selection procedures for item pools of simple structured items. Five item selection procedures are described in this section. For most of the procedures, one needs to find the maximum or minimum values for certain multivariable functions. The extreme value theorems in calculus state that: if a real-valued function f is continuous in the closed and bounded interval [a, b], then f must attain its maximum and minimum value, each at least once. That is, there exists numbers c and d in [a, b] such that: f (c) ≥ f (x) ≥ f (d) for all x ∈ [a, b]. To find the solution for the maximum and minimum value for a function, there are two steps: 1. Find all the critical values xc such that
∂f = 0. ∂x xc
(10)
2. Compare f (xc ), f (a), and f (b). The maximum or minimum solution is the x among xc , a, and b such that f (x) has maximum or minimum values. For a multivariable function, the method still holds. Four of the five procedures have been studied by previous research. The newly proposed procedure named V2 and the optimized weight for the overall score are described first in detail. 2.1. Select an Item That Has the Minimum Error Variance for the Composite Score of Optimized Weight, V2 In Yao (2010a), for the paper-and-pencil data, a procedure for deriving the overall score by the optimized weight was described. The weight is different at different score points. For a test
LIHUA YAO
with J items of known item parameters D and for a given score point θ , the test information is IJ (θ). The composite score θα = l=1 θl wl has a standard error of measurement SEM(θα ) = V (θα )1/2 , where V (θα ) = wV (θ)w T , w = (w1 , . . . , wD ) = (cos2 α1 , . . . , cos2 αD ). V (θ) can −1 be approximated by I (θ) . The weight w, called optimized weight, such that SEM(θα ) has a minimum value, does exist, as proved by the following theorem and Corollary 1. Theorem 1. Let BD×D = (aij ) be a matrix. The critical vector w = (w1 , . . . , wD ) such that D T wB w has extreme value under the assumption l=1 wl = A is
D D A A w = D D b1j , . . . , bDj = D D yB−1 , (11) b b k=1 j =1 kj j =1 k=1 j =1 kj j =1 where A is a constant, B−1 = (bij )D×D , and y1×D = (1, . . . , 1). Proof: Let
w T − λ f (w1 , . . . , wD , λ) = wB
D
wl − A .
(12)
l=1
To find the solution w such that the function f has extreme value, the derivatives are derived and set to be zero: ∂f = xBw T + wB xT − λ = 2 alj wj − λ = 0 ∂wl D
(13)
j =1
for l = 1, . . . , D, where x = (0, . . . , 0, 1, 0, . . . , 0) with the lth element of value 1. Moreover, ∂f = wl − A = 0. ∂λ D
(14)
l=1
Since the matrix B is symmetric, we have 2Bw T = λ yT ,
(15)
where y1×D = (1, . . . , 1). Multiply both sides of Equation (15) by B−1 , we obtain
D D λ w = b1j , . . . , bDj . 2 j =1
Since
D
l=1 wl
(16)
j =1
= A, we obtain 2A λ = D D k=1
j =1 bkj
(17)
.
Therefore, w = D k=1
A D
j =1 bkj
D j =1
b1j , . . . ,
D j =1
bDj
A = D D k=1
j =1 bkj
yB−1 .
(18)
PSYCHOMETRIKA
Corollary 1. Let the variance matrix at a given θ be VD×D = I−1 D×D , where ID×D = (bij ) is the information function at θ for a set of item parameters of simple structure. The weight w = = wV w T has minimum value is (w1 , . . . , wD ) such that f (w)
D D 1 b1j , . . . , bDj . (19) w = D D k=1 j =1 bkj j =1 j =1 The domain for function f (w) is a bounded region {(w1 , . . . , wD ): wj ≥ 0, j = 1, . . . , D, D w = 1}. j =1 j Proof: Let B = V = I−1 = (aij ) in Theorem 1. Then B−1 = I = (bij ). All the boundary (weight) vectors are of format xi = (0, . . . , A, 0, . . . , 0), where the ith element is A = 1, and i = 1, . . . , D. It can be seen that f (xi ) = A2 aii . For the critical vector w obtained in Theorem 1, we have 2 A2 A yB−1 yT = D D . (20) f (w) = D D k=1 j =1 bkj k=1 j =1 bkj For simple structured items, the information function I has zeros for all, except the diagonal. That is, ⎛ b2 0 0 0 ⎞ 1
⎜0 I=⎜ ⎝0 0
b22 0 0
0 0 ⎟ ⎟ . ··· 0 ⎠ 2 0 bD D×D
(21)
Therefore, aii =
1 , bi2
(22)
aij = 0 for i = j , and D D k=1 j =1
D 1 bkj = . akk
(23)
k=1
It is clear that A2 f (w) = D
1 k=1 akk
for all i = 1, . . . , D.
< f (xi ) = A2 aii
(24)
The optimized weight for the overall score has several advantages over a linear combination of domain scores, as discussed in the Introduction. Moreover, such a weighted overall score has the smallest SEM among all the weighted scores. Note that the formula for the optimized weight for the simple structured items is derived in Corollary 1. However, the optimized weight does exist for items of complex structure, and it is obtained numerically through computer program. For a given set of item parameters and known domain scores, the optimized weight can be derived. However, there are situations for which the domain scores are unknown. For example, for MCAT, the examinees’s domain abilities are unknown. How, then, do we choose the optimized weight? Two approaches for V2 are described: (a) having known optimized weight; (b) having no known weight and updating the weight.
LIHUA YAO
For all the procedures, the initial abilities are set to θl = 0 for l = 1, . . . , D. For j = 2, . . . , J , suppose that j − 1 items have been selected. To select the next j th item, suppose that the updated ability is θj −1 . For each of the procedures, the steps proposed are repeated. Please note that the procedures are for both Bayesian and non-Bayesian; for Bayesian, add −1 to the information. D For approach (a), suppose that θ = k=1 wk θk ; then Var(θ ) = wVar( θ)w T . The weight for selecting j items w j −1 = w is the prefixed optimized weight derived on the true domain ability and a set of item parameters. 1. For each item m in the pool, compute j −1 (Pm1 − β3m )2 (1 − Pm1 ) = Ij −1 θj −1 + Im β2m ⊗ β2m j θ Pm1 (1 − βmj )2
(25)
at ability θj −1 . j −1 )]−1 (w 2. Select item j = m such that w j −1 [Im j −1 )T has a minimum value. j (θ 3. Update ability θj and information Ij (θj ) based on the selected j items. However, in real practice, the true domain scores are unknown. In this study, the following steps are used for approach (b) in selecting items for V2 . Let M < J be a chosen integer. 1. For j < M, select items using steps shown above, with prefixed weight of equal values, i.e., wl = 1/D for l = 1, . . . , D. 2. For j ≥ M, compute the optimized weight w j −1 based on the j − 1 selected items. 3. Step 2 from approach (a) above. 4. Step 3 from approach (a) above. The integer M can be chosen by the user. For example, M = 0 or M = J3 , where J is the total number of selected items. These two numbers are applied in this study. 2.2. Select an Item That Has the Minimum Error Variance for the Composite Score of Equal Weight, V1 This method was studied in van der Linden (1999) for increasing the precision for overall scores. It is similar to V2 , with the weight w being prefixed with equal values, i.e., wl = 1/D for l = 1, . . . , D. To understand the next procedure Ag better, the following corollary and its proof are provided. Corollary 2. Let ID×D = (bij ) be the information function at θ for a set of item parameters that are of simple structure. The direction α = (α1 , . . . , αD ), such that f ( α ) = cos α I cos α T has minimum value is α = (π/2, . . . , π/2, 0, π/2, . . . , π/2), where the ith element is 0, and . , cos αD ). The domain for function f ( α) bii = min{bkk , k = 1, . . . , D}. Here cos α = (cos α1 , . . 2 is a bounded region, E = {(α1 , . . . , αD ): 0 ≤ αk ≤ π2 , D k=1 cos αk = 1}. Proof: Since bij = 0 for i = j , we have f ( α) = 1, . . . , D. Let f ( α , λ) =
D l=1
bl2 cos2 αl
−λ
D
2 2 l=1 bl (cos αl ),
D
where bll = bl2 for l =
cos αl − 1 . 2
(26)
l=1
To find the solution α such that the function f ( α , λ) has extreme value, the derivatives are derived and set to be zero: ∂f = −2bl2 cos αl sin αl + 2λ cos αl sin αl = 0. (27) ∂αl
PSYCHOMETRIKA
Therefore (λ − bl2 ) cos αl sin αl = 0 for l = 1, . . . , D, and ∂f 2 cos αl − 1 = 0. = ∂λ D
(28)
l=1
Suppose that λ = bl2 for all l = 1, . . . , D, then αl = 0 or π2 for l = 1, . . . , D. Since D π 2 l=1 cos αl = 1, and there is only one i such that αi = 0 and αj = 2 for j = i. Therefore, α l = (π/2, . . . , 0, π/2, . . . , π/2), where the lth element is 0, is a critical point where l = 1, . . . , D. Moreover, f ( αl ) = bl2 . 2 Suppose that λ = bk for some k ∈ {1, . . . , D} and there is at least one i such that bi2 = bk2 ; then αi = 0 or π2 . If αi = 0, then the critical point is α i = (π/2, . . . , 0, π/2, . . . , π/2), where the ith element is 0. If for all i such that bi2 = bk2 , we have αi = π/2, then {l,bl =bk } cos2 αl = 1 and f ( α ) = {l,bl =bk } bl2 cos2 αl = bk2 . Let i be a number such that bii = min{bkk , k = 1, . . . , D}, and α i = (π/2, . . . , π/2, 0, π/2, . . . , π/2) be a boundary value in the domain. Then it is not difficult to see that f ( αi ) = min{f ( α ) | α ∈ E}, as cos 0 = 1. 2.3. Select an Item That Has the Maximum Information in the Direction That Has the Minimum Information for Previously Selected Items, Angle or Ag 1. At the ability level θj −1 , let the direction α = (α1 , . . . , αD ) be the minimizer such that cos( α )Ij −1 (θj −1 ) cos( α )T has a minimum value for all possible angles. Here cos( α) = (cos α1 , . . . , cos αD ). 2. For each item m in the pool, compute j −1 (Pm1 − β3m )2 (1 − Pm1 ) Im = Ij −1 θj −1 + β2m ⊗ β2m j θ Pm1 (1 − βmj )2 at ability θj −1 . j −1 ) cos( 3. Select item j = m such that cos( α )Im α )T has a maximum value (among all j (θ the items in the pool). 4. Update ability θj and information Ij (θj ) based on the selected j items. From Corollary 2 one can see that the minimizer for the directional information does exist and can be derived mathematically. Moreover, Ag selects an item in the direction that has less information (i.e., the direction is aligned with the dimension that has less information). 2.4. Select an Item That Has the Maximum Volume or the Maximum Determinant of the Information, Volume or Vm In Segall (1996), he proposed selecting the next item j by maximizing the determinant of the posterior information as follows: W = Ij −1 θj −1 + Ij θj −1 + −1 , (29) where Ij −1 (θj −1 ) is the information obtained from already selected j − 1 items at the ability estimates θj −1 . 1. For each item m in the pool, compute the volume or the determinant of the information using j −1 (Pm1 − β3m )2 (1 − Pm1 ) −1 Wm = Ij −1 θ β2m ⊗ β2m + + 2 Pm1 (1 − βmj ) at ability θj −1 .
LIHUA YAO
2. Select item j = m such that Wm has the maximum value. 3. Update ability θj and information Ij (θj ) based on the selected j items. For the non-Bayesian procedure, the above equations still hold with the removal of −1 . However, the first D items must be selected from the D domains, especially for the items of simple structure, as the matrix needs to be nonsingular. To understand Vm and Ag better, the following corollary is derived. Corollary 3. Let ID×D = (bij ) be the information function at θ for a set of item parameters that are of simple structure. Suppose that the item pool is ideal (Reckase, 2009): it contains items of discrimination 1 and of any difficulty level of your choice. Then Ag and Vm select the same set of items, maybe with a different order. Proof: Let
⎛
b1 ⎜0 I=⎝ 0 0
0 b2 0 0
⎞ 0 0 0 0 ⎟ . ⎠ ··· 0 0 bD D×D
(30)
Without loss of generality, suppose that b1 = min{bl , l = 1, . . . , D}. The determinant for adding an item of information k from the first domain is W = (b1 + k) D l=2 bl , which is larger than D or equal to W = (bj + k) l=j bl for any j = 1. Since it is an ideal pool, we can assume that such k is available for any domain. Therefore, the next selected item has to come from one of the domains that has less information or fewer selected items. From Corollary 3, for an ideal item pool, Vm selects items alternately from all the domains. From Corollary 2, for an ideal item pool, Ag selects the next item in the direction that has less information or fewer selected items. Therefore, Ag and Vm select the same set of items, but perhaps in a different order. In general, both Ag and Vm select the next item in the direction with less information. However, the actual item pool contains items of different statistics; therefore, Ag and Vm may select different sets of items in general. Using the Vm method, Mulder and van der Linden (2009) stated that Vm selects the next item with a large discrimination in the dimension that has small information. Large discrimination results in large information. Therefore, their statement is consistent with the theoretical results in this study and the Ag method. 2.5. Select an Item That Has the Maximum Posterior KL Information, KL For an M-3PL item m, the Kullback–Leibler information is the distance between two likelij −1 j −1 hoods at two ability points, θj −1 = (θ1 , . . . , θD ) and θ0 , and is defined as j −1 Pm (Xm | θ0 , βm ) , θ0 = Eθ0 log Km θ Pm (Xm | θj −1 , βm ) Pm1 (θ0 ) 1 − Pm1 (θ0 ) = Pm1 (θ0 ) log + 1 − Pm1 (θ0 ) log , (31) Pm1 (θj −1 ) 1 − Pm1 (θj −1 ) where θ0 is the true ability, and θj −1 is the current ability estimates based on selected j − 1 items. The Kullback–Leibler information tells us how well the response variable discriminates between the ability estimates and the true ability value. For j − 1 selected items, define Kj −1 (θj −1 , θ0 ) = j −1 j −1 , θ0 ). The Bayesian KL for item m (Chang & Ying, 1996; Veldkamp & van der l=1 Kl (θ Linden, 2002) is
PSYCHOMETRIKA
= Km θj −1 X
θ
=
d θ Kj −1 θj −1 , θ + Km θj −1 , θ f (θ | X) j −1
θ1
+δj
j −1 θ1 −δj
···
j −1
θD +δj j −1 θD −δj
dθ1 · · · dθD , Kj −1 θj −1 , θ + Km θj −1 , θ f (θ | X) (32)
where δj =
√3 . j
using 1. For each item m in the pool, compute the posterior KL information Km (θj −1 | X) is the response vector for the selected j − 1 items. Equation (32) with δj = √3j . Here X has the maximum value. 2. Select item j = m such that Km (θj −1 | X) 3. Update ability θj based on the selected j items. 2.6. Select an Item That Has Maximum Information for the Composite Score in the General Dimension, G This general dimension method, G, for overall score was first proposed in Segall (2001). An item is selected if it has the maximum information in the general dimension at the current ability estimates θj −1 . The general dimension is the first dimension that all items from all the domains loaded on. The theorem and corollaries show that Ag and Vm have similar goals when selecting items. They balance the information for each of the domains. However, V1 and V2 have another goal: they favor the overall score (i.e., maximum information for the overall score) which is defined by the weighted sum of the domain scores; the weight for V2 is different at different score points. KL is completely different. Wang, Chang, and Boughton (2011) concluded that KL favors MDISC, which is the item discrimination for items of simple structure. Therefore, KL selects items with a different purpose than the other four procedures do; it does not favor the overall score, and it does not consider content information balancing. KL selects items with a likelihood that is maximally different from the likelihood at any other ability level (Veldkamp & van der Linden, 2002).
3. Applications The five CAT item selection procedures Ag, Vm, V1 , V2 , and KL for both domain scores and overall scores are compared through simulated data; they are also compared with the G method for the overall scores. The item pool and simulation conditions are discussed in this section. Please note that while the data used in this study follow the M-3PL model, the item selection methods and software have the capability for polytomously scored items following the generalized two-parameter partial credit models. 3.1. Item Pool The item parameters in the item pool came from the item parameter estimates of live data of approximately 176,000 examinees taking the CAT Armed Services Vocational Aptitude Battery (ASVAB) to qualify for service in the U.S. Military. Each examinee took four tests totaling 55 items: 15 Arithmetic Reasoning (AR), 15 Word Knowledge (WK), 10 Paragraph Comprehension (PC), and 15 Math Knowledge (MK). The 176,000 examinees, as a group, generated approximately 49,000 responses for each item; and the union of their tests consisted of 257 AR items, 265 WK items, 144 PC items, and 246 WK items, a total of 912 items. The MIRT confirmatory analysis was conducted in the following three ways to obtain the item parameters for the three sets of item pools.
LIHUA YAO TABLE 1. Dimensional-loading information for the seven-dimensional general model.
Item AR WK PC MK
Dimension AFQT
MA
x x x x
x
VB
WK
PC
MK
x x x
x
AR
x x x
• Pool 1—P1: MIRT four-dimensional calibration with simple structure, with dimensions 1 to 4 representing content AR, WK, PC, and MK, respectively. The scale was fixed by imposing the prior of the ability distribution to be a standard multivariate normal with a mean of (0, 0, 0, 0), a variance = 1, and a correlation = 0. The posterior population distribution was estimated by BMIRT (Yao, 2003), and the estimated variance-covariance matrix is ⎛ ⎞ 1 0.5 0.5 0.7 ⎜ 0.5 1 0.6 0.4 ⎟ A=⎝ . (33) ⎠ 0.5 0.6 1 0.4 0.7 0.4 0.4 1 4×4 • Pool 2—P2: MIRT four-dimensional calibration with simple structure. The scale was fixed by imposing the prior of the ability distribution to be multivariate normal with a mean of (0, 0, 0, 0) and a variance-covariance matrix of A as shown in Equation (33), which is similar to those obtained from the total raw score of the responses. • Pool 3—P3: MIRT seven-dimensional calibration, with dimensional loadings as shown in Table 1. The seven dimensions measure three composite scores comprising the Armed Forces Qualification Test (AFQT), the Math (MA), and Verbal (VB); and the four domain scores comprising AR, WK, PC, and MK. The scale was fixed with the prior of the population distribution to be a seven-dimensional standard multivariate normal, with means 0’s and variances 1’s and covariances 0’s. The first dimension measuring AFQT is called the general dimension, with all the items loaded on it. P1 and P2 are for the four content domain scores and their composite score AFQT. P1 and P2 are different in that they have different coordinates for obtaining the item parameters; good item parameter estimates in the pool will better serve the testing purpose. P3 is mainly for the general dimension AFQT and the two composites MA and VB. The four domains will not be examined because they cannot be estimated accurately. The item summary statistics for the three pools and the four contents are displayed in Table 2. All three pools show that the item difficulty and guessing have similar statistics. Compared to the other three domains, the PC items have smaller discriminations and smaller difficulties. For P3, the item discrimination for dimension AFQT has similar statistics to P1 and P2; however, the discrimination for the other six dimensions are much smaller than those for P1 and P2. 3.2. Simulation Conditions Simulate four sets of size 1000 examinees with multivariate normal with the mean and the variance-covariance matrix indicated as follows: B1 = (0, 0, 0, 0), B2 = (0, 0.5, −0.5, 1.0), B3 = (−1, 0.0, 1, −0.7), A1 is an identity matrix, and A2 is the same as Equation (33). The four sets of populations named C1, C2, C3, and C4 are listed in Table 3. For each of the true abilities sampled, 10 different seeds are used in generating the responses (i.e., 10 replications).
PSYCHOMETRIKA TABLE 2. Item statistics for the three pools.
Type
Discrimination
Difficulty
AFQT MA VB AR WK PC
MK AR
WK
Guessing PC
MK
AR WK PC
MK
P1 MIN MAX MEAN STD
0.12 5.78 2.05 0.73
0.34 4.14 2.13 0.71
0.27 3.07 1.70 0.55
0.10 −7.60 −8.45 −6.04 −7.49 5.99 5.97 6.37 3.49 7.60 2.32 −0.79 −0.34 −1.30 −0.15 0.90 2.42 2.96 2.16 2.61
0.05 0.34 0.19 0.03
0.08 0.32 0.19 0.02
0.10 0.22 0.19 0.01
0.07 0.37 0.19 0.03
P2 MIN MAX MEAN STD
0.63 5.84 2.00 0.59
0.27 3.96 2.02 0.59
0.83 4.78 1.89 0.54
0.16 −7.35 −8.39 −7.23 −7.92 5.99 5.55 6.37 4.58 8.99 2.24 −0.76 −0.35 −1.40 −0.17 0.86 2.55 3.07 2.58 2.72
0.05 0.35 0.19 0.03
0.08 0.29 0.19 0.02
0.08 0.22 0.18 0.02
0.062 0.392 0.190 0.038
0.03 4.30 0.62 0.50
0.03 2.72 0.60 0.47
0.09 1.64 0.55 0.34
0.03 −7.07 −8.18 −6.18 −8.111 0.05 5.99 5.37 6.83 2.87 7.751 0.26 0.91 −0.78 −0.38 −1.30 −0.104 0.18 0.95 2.40 3.07 2.12 2.613 0.02
0.08 0.30 0.19 0.02
0.08 0.24 0.18 0.02
0.032 0.294 0.183 0.036
P3 MIN MAX MEAN STD
0.21 5.92 1.72 0.63
0.00 3.82 0.49 0.66
0.00 3.10 0.57 0.75
TABLE 3. Population distributions for the simulees.
Name
Population variance-covariance
C1 C2 C3 C4
A1 A2 A1 A2
Population mean B1 B1 B2 B3
TABLE 4. Varying conditions for the simulations.
Content
Method
Test length
Population
Item pool
0 1 2
Ag Vm V1 V2 KL G
18 36 55
C1 C2 C3 C4
P1 P2 P3
Table 4 lists all the conditions considered. The first column, Content, indicates the treatment for the content/objective/domain. Content = 0 indicates that no content information is considered; items are selected based on the procedures from the item pools. Content = 1 indicates that each content has to have the required number of items, and the item order does not matter. Content = 2 indicates that the items are selected alternatively for each of the content areas and each content area has to have the required number of items. For example, an examinee may have items in the order of WK, AR, PC, and MK. If the number of items reaches the maximum number in a particular content area, then stop selecting items from that content area. Three test lengths of size 18, 36, and 55 are considered.
LIHUA YAO
For the content balancing procedure, the percentage of items selected for AR, WK, PC, and MK are 28, 28, 16, and 28, respectively. 3.3. MIRT Domain Ability and Overall Ability Estimates For the four domain scores (for P1 and P2), the updated scores after each item selection and the final scores are obtained through the MIRT Bayesian maximum a posterior (BMIRT). Yao (2010b) has found that this is the best method for estimating the MIRT domain abilities; noninformative priors are used to reduce the bias. For V1 , the “true” and the final estimated overall scores are a simple average of their four domain scores. For Ag, Vm, V2 , and KL, the “true” and the final overall scores are obtained from the weighted sum of the four domain scores with the optimized weight at their domain score points. Corresponding to the two approaches in selecting items for V2 , there are two approaches for computing the “true” and the estimated overall score: (a) the prefixed optimized weights are obtained based on a set of item parameters; (b) the optimized weights are obtained based on the selected item parameters. For the simulated data, the true domain scores are known, therefore approach V2 (a) is applied, with the item parameters coming from the paper-and-pencil ASVAB data for the four contents with the same purpose of testing. There are 30, 35, 15, and 25 items for AR, WK, PC, and MK, respectively, for this ASVAB data. Even though the V2 (a) approach is not possible in real practice, it was still applied for the purpose of comparing it to other procedures. Approach V2 (b) is applied for Content = 1 and 2; it is not useful for Content = 0, as the selected items would be mostly from one domain. For V2 (b), the “true” and estimated overall score are based on the optimized weight derived on the selected items, and its results are compared with the results from the other procedures with their “true” and estimated overall score computed in the same manner. 3.4. Test Reliability Following Segall (2001), for a test with item parameters given, the test reliability is defined as follows: for an ability level θi of a value that comes from Q = 31 equally spaced ability levels between −3 and 3, i.e., θi ∈ {−3, −2.8, . . . , 3}, simulate N = 500 set of responses for the test based on θi , item parameters, and the IRT models. For a multidimensional model with an ability level θi of dimension D, it is one of the points in a set that has QD element and in space [−3, 3] × · · · × [−3, 3], with Q = 7, and each coordinate of θi comes from Q equally spaced ability levels between −3 and 3. For each θi , compute the mean and variance of the raw scores for the N simulees. Let xn (θi ), n = 1, . . . , N , indicate the number-correct score for the nth simulee. Then the mean and variance of the scores are N xn (θi ) m(θi ) = m(x | θi ) = n=1 , (34) N N [xn (θi ) − m(x | θi )]2 . (35) Var(θi ) = var(x | θi ) = n=1 N −1 Finally, the overall mean and variance can be obtained by D
E=
Q
vi m(θi ),
(36)
i=1 D
Q 2 Var = vi m(θi ) − E , i=1
(37)
PSYCHOMETRIKA
where vi is proportional to the normal density at θi , and the error variance be computed by
QD
i=1 vi
= 1. Let the mean variance or
D
E(Var) =
Q
vi Var(θi ).
(38)
i=1
In general, the test reliability is defined by r=
σT2 σT2 + σE2
,
(39)
and in this case it is computed by r=
Var . Var + E(Var)
(40)
With 74 θ points and each with 500 simulees, the computation time for the test reliability is long. 3.5. Evaluation Criteria For each CAT item selection procedure, each pool, each test length, and each population, time, test reliability, item usage, and absolute bias (ABSBIAS) for the four domain scores and the composite score AFQT are computed by averaging over those for the N = 1000 simulated ˆ examinees. ABSBIAS for the overall score is defined by ABSBIAS = N1 N i=1 |θi − θi |. The CAT item selection procedures are compared by examining time, test reliability, item usage, ABSBIAS, and the correlations between the estimates and the true for the four domains and the overall scores. For each of the 1000 examinees for each score of interest, the RMSE and BIAS for the ten replications are calculated for the purpose of examining the stability in selecting n items and (f −f ) estimating the abilities. For each condition and each examinee, define BIAS = r=1 nr true , where n indicates the number of replications, and fr and ftrue indicate score estimates for the rth replication and the true values, respectively.
4. Results 4.1. Comparison of the Five Selection Methods Results for approach (a) are presented first. For each of the 1000 true abilities, the estimated abilities are the average of the 10 replications. To examine the differences between the five selection procedures conditional on ability level, the BIAS for the four domains and overall scores were plotted against their true values for the condition Content = 0. Overall, the plots showed that Ag and Vm recovered the true values similarly for both domains and overalls; however, V1 , V2 , and KL behaved differently. For demonstration purposes, Figure 1 shows the results for AR, PC, and AFQT for population C2 and pool P2 with test length 55. V2 performed the poorest for domain scores, followed by V1 and KL. Both Ag and Vm delivered the best performance for the domain and overall scores. KL has a different pattern; some domains were not recovered well, and KL recovery of the overall scores was more similar to Ag than to V1 . Examining the summary statistics for the 1000 simulees for the correlations and the ABSBIAS between the estimated ability and the true ability, time, and test reliability for the five selection procedures, it was found, as expected, that for all the conditions, as the test length increased, the time used, the test reliability, and the correlations increased, and ABSBIAS decreased. The time used in selecting 18, 36, and 55 items had a range of 0.01–0.1, 0.03–0.3, and
LIHUA YAO
F IGURE 1. BIAS against the true values for the domain and overall scores for the five selection procedures for population C2 and P2 with test length = 55 without content restriction (Content = 0).
0.05–0.4 seconds, respectively. The test reliability for selecting 18, 36, and 55 items had ranges of 0.74–0.91, 0.86–0.96, and 0.91–0.98, respectively. As the Content changed from 0 or 1 to 2, the time used decreased for all the procedures. As the Content changed from 0 to 1 or 2, the correlations increased for all except Ag, the ABSBIAS decreased for all except Ag, and the test reliability decreased for all except Vm. The Content changing from 0 to 1 or 2 affected V2 the most, followed by V1 and KL; it did not affect Ag and Vm much. Figure 2 shows the results for Content = 1 for population C2 and pool P2. The examples of results provided in Figures 1 and 2 serve to illustrate the conclusions. Several summary statistics are listed using Table 5 and Figure 3. Table 5 shows the results for the population C2 with P2 for the three test lengths and three content areas. Figure 3 shows test reliability, time, ABSBIAS, and the correlations between esti-
PSYCHOMETRIKA
F IGURE 1. (Continued.)
mated ability and the true ability for the four domains and the composite for the population C1 for P2, in selecting 36 items, where the x-axis indicates the five CAT selection procedures crossed with each of the three content areas. For example, Ag0 represents method Ag with Content = 0. When Content = 0, the domain scores are poorly estimated, especially for PC with method V2 , as there are not enough items in the domain being selected; the precision for the composite score AFQT is the criterion in selecting items. This issue is made more apparent in a later section discussing item usage rate. When Content = 1 or Content = 2, V1 , V2 , Vm, and KL yield similar correlations and ABSBIAS for the four domains and overall scores, with V2 slightly better than the others (Figure 2, AFQT V2 ). Ag and Vm have well balanced item selections among domains,
LIHUA YAO
F IGURE 1. (Continued.)
even when Content = 0; the correlations for each of the domains for Content = 0 are similar to those for Content = 1 or 2. For V2 approach (b), M = 0 and M = J3 yield similar results, both in ability estimates and item usage. Therefore, results for M = 0 are presented. Table 6 shows the correlations for the four domains and the overall scores for the five procedures for approach (b) and test length = 55. It can be observed that V2 yields similar results as the other four procedures. The results for Content = 1 are slightly better than those for Content = 2. When using Content = 1, the items from each domain are chosen sequentially. For example, for a test of length = 18, the first 5 items come from AR, the next 5 items from WK, the next 3 from PC, then, finally, the last 5 from MK. 4.2. Effect of Sample Population and Item Pool Results for approach (a) are examined for the effect of the populations and the pool on the estimation of ability. Examining the summary statistics between P1 and P2 and among the four populations C1–C4, it was observed that the population distribution has some effect on the estimation of the domain scores and a large effect on the composite score. For populations with nonzero correlations between domains, the AFQT shows better results than those from populations with zero correlations (C2 versus C1; C4 versus C3); populations with mean zeros show better results than those from populations with nonzero means (C1 versus C3; C2 versus C4). The AFQT has high correlations (higher than the domains) for populations of nonzero correlation, but has slightly lower correlations for populations of zero correlation. For C3 and C4, MK and PC have smaller correlations; the population means are far away from the means of
PSYCHOMETRIKA
F IGURE 2. BIAS against the true values for the domain and overall scores for the five selection procedures for population C2 and P2 with test length = 55 when Content = 2.
the item difficulty. The two pools yield slightly different results. Overall, P2 yields slightly better results than P1, especially for PC. The correlations for PC for P2 are higher than those from P1. Figure 4 demonstrates the results. The plots in Figure 4 show the correlations for the four domains and the AFQT for P1 and P2 and for all four populations, where the x-axis represents the four populations and the two pools P1 and P2, with Content = 1 in selecting 55 items. In practice, when some content has item parameters of lower discriminations, P2 is recommended.
LIHUA YAO
F IGURE 2. (Continued.)
4.3. Item Pool Usage Rate For approach (a), for each examinee, the selected items vary between the 10 replications. Figure 5 shows the percentage of items that are not selected for AR, WK, PC, MK, and combined in selecting 55 items for all the conditions, where the x-axis represents the five procedures and the three content conditions. When Content = 1 and Content = 2, Ag has the largest number of items not being selected, followed by Vm. V1 , V2 , and KL have similar rates, with V2 having the smallest number of items not being selected, followed by KL and V1 . It is interesting to see that V2 has the smallest number of unselected items for overall and for the domains,
PSYCHOMETRIKA TABLE 5. Correlations, ABSBIAS, time, and test reliability for the five selection procedures for population C2 with P2 for the three test lengths.
Method
Test length = 18
Test length = 36
Test length = 55
Content
Content
Content
1
2
0
1
2
0
1
2
AFQT correlations Ag 0.98 Vm 0.98 0.96 V1 0.93 V2 KL 0.96
0
0.97 0.98 0.98 0.98 0.98
0.96 0.98 0.98 0.98 0.98
0.99 0.99 0.97 0.94 0.98
0.99 0.99 0.99 0.99 0.99
0.99 0.99 0.99 0.99 0.99
0.99 0.99 0.97 0.95 0.99
0.99 0.99 0.99 0.99 0.99
0.99 0.99 0.99 0.99 0.99
AFQT ABSBIAS Ag 0.23 Vm 0.23 0.29 V1 0.30 V2 KL 0.24
0.27 0.22 0.25 0.21 0.20
0.29 0.22 0.26 0.22 0.20
0.16 0.15 0.22 0.27 0.18
0.18 0.15 0.16 0.15 0.14
0.20 0.15 0.16 0.15 0.14
0.13 0.12 0.17 0.24 0.13
0.15 0.12 0.12 0.11 0.11
0.16 0.12 0.12 0.12 0.11
Time Ag Vm V1 V2 KL
0.09 0.10 0.07 0.07 0.08
0.04 0.02 0.01 0.01 0.02
0.19 0.20 0.12 0.12 0.15
0.21 0.22 0.14 0.14 0.15
0.08 0.05 0.03 0.03 0.04
0.28 0.32 0.19 0.18 0.20
0.32 0.33 0.21 0.21 0.23
0.13 0.08 0.05 0.05 0.06
0.744 0.811 0.812 0.813 0.823
0.709 0.808 0.801 0.800 0.821
0.905 0.906 0.934 0.959 0.940
0.878 0.912 0.911 0.912 0.917
0.863 0.911 0.909 0.909 0.918
0.943 0.943 0.958 0.975 0.962
0.922 0.945 0.945 0.945 0.950
0.910 0.945 0.944 0.944 0.949
0.09 0.09 0.06 0.06 0.08
Test reliability Ag 0.799 Vm 0.803 0.859 V1 0.905 V2 KL 0.868
except for PC for C1 and C3 (r = 0), even when Content = 0. When Content = 0, Ag and Vm have the smallest percentage of unselected items for PC; KL has the smallest percentage of unselected items for MK; V2 has the smallest percentage of unselected items for WK; and V1 has the smallest percentage of unselected items for MK. V1 , V2 , and KL have larger percentages for PC because PC items have lower discriminations than the other domains. Ag selects more PC items because each item in PC has low information. These results confirm the theorem and its corollaries. Populations and pools have slight effects on the item usage. P1 has a higher percentage of unselected items comparing to P2. Table 7 lists the number of unselected items for C2 and P2. Overall, Content = 1 has a larger percentage of item usage comparing to Content = 2. When Content = 1, V2 has the largest percentage of item usage for test length of 18 and 35. Table 8 lists the number of selected items and the ability estimates for the domains and overall for the five procedures for one examinee in population C2 for P2 in selecting 36 items when Content = 0, 1, and 2; their true values are listed. The weights for V2 , Ag, Vm, and KL are included in the table. When Content = 0, Ag and Vm have a more balanced item selection and, therefore, better domain estimates. However, for V1 , V2 , and KL, some domains do not have enough items; therefore, bad estimates are observed for that domain (e.g., PC). Clearly, when Content changes from 0 to 2, V2 improves significantly. V2 , Vm, and KL have similar results and are better than those for Ag and V1 .
LIHUA YAO
F IGURE 3. Test Reliability, Time, ABSBIAS, and correlations for AR, WK, PC, MK, and AFQT for the five CAT item selection procedures in selecting 36 items from P2 for population C1, with three content areas. Note: X-axis has a format with “AB” which represents procedure “A” and Content = “B”.
Table 9 compares the item usage rate for V2 for the two approaches (a) and (b) for P2; for the other four procedures, the selected items are the same for the two approaches. For approach (b), Content = 1 and 2 have similar item usage rate. For Content = 1 and for a test of length 55, the two approaches again have similar item usage rate; however, for a test of length 18 or 36, approach (b) has a much smaller item usage rate than approach (a). For Content = 2, the two approaches have similar item usage rate for all three test lengths. Please note that the populations have slight effects on the item usage rate. 4.4. G Method for P3 For P3, the CAT G method selects the next item favoring the general dimension. The 1000 true abilities are sampled from the standard seven-dimensional multivariate normal distribution. Table 10 list the time, ABSBIAS, and correlations for the seven dimensions in selecting 18, 36, and 55 items. It can be seen that as the content changes from 0 to 1 or 2, the correlations for the AFQT, MA, and AR decreased while the ABSBIAS increased. However, the opposite results are observed for VB, PC, and WK where the correlations increased and the ABSBIAS decreased. When the Content = 0, more items in AR and MK are selected, because their items have higher discriminations in the general dimension. When Content = 2, the time used is much smaller than for the other values of Content, and the correlations are higher than those for Content = 1 for all four domains and the AFQT, except for WK. The correlations for the AFQT had ranges of 0.93– 0.94, 0.95–0.96, and 0.96–0.97 for test lengths of 18, 36, and 55, respectively. These correlations are similar or smaller than those for the five methods. For the four domains and the other two
PSYCHOMETRIKA
F IGURE 4. Correlations for AR, WK, PC, MK, and AFQT for the five item selection procedures in selecting 55 items from pool 1 and 2 for population C1 to C4, with Content = 1. Note: PiCj = Pool i and population Cj , where i = 1, 2; j = 1, 2, 3, 4.
composites, MA and VB, the results for the G method dropped significantly compared to the other five methods. Because of time and technology constraints, only one replication for the G method was used; the correlations would increase slightly with 10 replications. 4.5. Comparing CAT and Paper-and-Pencil (PP) Format Table 11 compares the correlations between the estimated and true abilities for this CAT study and a study for the MIRT PP format. The results for the PP are taken from Yao (2010a), for the population of correlations r = 0.5 between domains, and a sample size of 2000. The two studies, CAT and PP, are from two different simulations; strictly speaking, they cannot be compared. However, this comparison can be helpful because the simulation conditions are
LIHUA YAO
F IGURE 5. Rate for unselected items for AR, WK, PC, and AFQT for P1 and P2 in selecting 55 items for all the procedures. Note: X-axis has a format with “AB” which represents procedure “A” and Content = “B”.
similar. One can see that the results for CAT with a test length of 18 are much better than the results for PP of a test length of 60. This supports the claim that CAT can increase the precision or reduce test length.
5. Discussion Many studies have been conducted regarding unidimensional CAT. However, more research on MCAT is needed. MCAT can reduce the test length and increase the precision of ability estimates compared to unidimensional CAT. Similar to the PP format, MCAT can report valid
PSYCHOMETRIKA TABLE 6.
Correlations for approach (b) for populations C1–C4 with P2 for test length = 55.
Population C1 C1 C1 C1 C1 C2 C2 C2 C2 C2 C3 C3 C3 C3 C3 C4 C4 C4 C4 C4
Method Ag Vm V1 V2 KL Ag Vm V1 V2 KL Ag Vm V1 V2 KL Ag Vm V1 V2 KL
Content = 1
Content = 2
AR
WK
PC
MK
AFQT
AR
WK
PC
MK
AFQT
0.993 0.995 0.996 0.995 0.996 0.993 0.995 0.995 0.995 0.995 0.993 0.995 0.995 0.995 0.995 0.987 0.989 0.989 0.990 0.991
0.992 0.994 0.995 0.995 0.995 0.993 0.995 0.995 0.995 0.995 0.993 0.995 0.995 0.995 0.995 0.990 0.994 0.994 0.995 0.995
0.990 0.990 0.991 0.990 0.990 0.989 0.989 0.990 0.989 0.990 0.987 0.986 0.987 0.986 0.989 0.984 0.984 0.984 0.984 0.985
0.992 0.995 0.995 0.995 0.995 0.993 0.996 0.996 0.996 0.996 0.991 0.992 0.992 0.992 0.992 0.989 0.989 0.989 0.990 0.992
0.991 0.993 0.993 0.994 0.994 0.996 0.997 0.997 0.997 0.998 0.991 0.992 0.992 0.992 0.993 0.994 0.995 0.995 0.995 0.996
0.992 0.995 0.995 0.996 0.995 0.992 0.995 0.995 0.995 0.995 0.991 0.994 0.995 0.995 0.995 0.985 0.988 0.989 0.989 0.990
0.992 0.994 0.995 0.994 0.995 0.991 0.995 0.995 0.995 0.995 0.988 0.995 0.995 0.995 0.995 0.989 0.994 0.995 0.995 0.994
0.986 0.990 0.990 0.990 0.990 0.983 0.989 0.989 0.990 0.991 0.973 0.987 0.987 0.985 0.988 0.983 0.984 0.985 0.984 0.982
0.991 0.995 0.995 0.995 0.995 0.991 0.996 0.996 0.996 0.996 0.982 0.992 0.992 0.992 0.992 0.988 0.990 0.991 0.991 0.992
0.989 0.994 0.993 0.994 0.994 0.996 0.997 0.997 0.997 0.998 0.985 0.992 0.991 0.992 0.993 0.993 0.995 0.995 0.995 0.996
TABLE 7. Number of unselected items for the five selection procedures for population C2 with P2 for the three test lengths.
Method
Ag Vm V1 V2 KL
Test length = 18
Test length = 36
Test length = 55
Content
Content
Content
0
1
2
0
1
2
0
1
2
759 783 799 604 782
759 783 799 649 782
821 840 830 830 809
569 641 637 394 630
632 625 637 598 630
708 715 703 703 673
409 530 496 270 495
579 551 553 555 544
620 555 547 545 542
and reliable domain scores and overall scores, with much fewer items than unidimensional CAT and MIRT PP test. This research considered five domain-specific item selection procedures by varying the population distribution of the examinees, test length, content consideration, and item pool structure. The procedures were compared by examining (1) ABSBIAS, (2) correlation, (3) time, (4) test reliability, and (5) item usage. For each condition, the results for the three content areas were compared. Two procedures, Ag and Vm, selected items balancing the information for all four domains, even when Content = 0. When Content = 0, the domain scores and the overall scores for Ag and Vm were estimated better than the other three procedures; moreover, Ag had the largest percentage of item usage among all conditions. However, V1 , V2 , and KL did not work well when Content = 0; some domain scores were estimated poorly, and the selected items were not well balanced among the domains, especially for V2 . Since there is no content restriction, V2 tends to select items in the dimensions that have larger weight. When Content changed from 0
Ag Estimates Estimates Estimates Vm Estimates Estimates Estimates V1 Estimates Estimates Estimates V2 Estimates Estimates Estimates KL Estimates Estimates Estimates
Method
TABLE 8.
True ability Selected items Content = 0 Content = 1 Content = 2 Selected items Content = 0 Content = 1 Content = 2 Selected items Content = 0 Content = 1 Content = 2 Selected items Content = 0 Content = 1 Content = 2 Selected items Content = 0 Content = 1 Content = 2
−1.16 (w4 = .066) 10 −0.58 −0.76 −0.90 12 −1.09 −1.16 −1.03 22 −1.06 −0.80 −0.94 0 0 −0.96 −1.09 22 −1.10 −1.02 −0.93
−0.02 −0.19 −0.07
−0.36 −0.50 −0.44 0 0.28 −0.11 −0.06
−0.10 −0.09 −0.06
−0.12 −0.05 −0.02
−0.17
6 −0.67 (w3 = .103) 8 −0.62 −0.45 −0.33 6 −0.51 −0.63 −0.52 4 −0.23 −0.61 −0.37 0 0 −0.64 −0.43 3 −0.27 −0.50 −0.51
0.52 (w1 = .575) 9 0.37 0.50 0.58 9 0.51 0.52 0.55 8 0.51 0.48 0.50 36 0.48 0.48 0.55 10 0.45 0.47 0.56
10
AFQT
−1.26 (w2 = .255) 9 −0.92 −0.95 −1.03 9 −1.09 −0.97 −1.02 2 −0.69 −1.08 −0.97 0 0 −1.0 −1.02 1 −0.69 −1.29 −1.09
Overall MK
PC
Domain AR WK Number of selected items for Content = 1 and 2 10 10
Number of selected items and ability estimates for one examinee for population C2 and P2 in selecting 36 items when Content = 0, 1, and 2.
LIHUA YAO
PSYCHOMETRIKA TABLE 9. Percentage of unselected items for V2 for the two approaches (a) and (b) for P2 for the three test lengths.
Test length J
Approach (a)
Approach (b)
AR
WK
PC
MK
AFQT
AR
WK
PC
MK
AFQT
Content = 1 18 36 55
71 61 61
65 59 59
83 77 63
69 68 60
71 65 60
93 78 61
92 74 58
96 84 63
92 76 60
93 77 61
Content = 2 18 36 55
93 81 61
91 79 58
89 67 62
88 76 58
91 77 59
93 78 63
92 74 59
95 83 62
89 73 58
92 76 60
to Content = 1 or 2, Ag performed slightly worse, and Vm performed similarly, while V1 , V2 , and KL performed better for both domains and overalls. Content = 1 and Content = 2 yielded similar results, with Content = 1 having larger item usage and test reliability. When Content = 1, V1 , V2 , Vm, and KL produced similar results, with V2 having the largest item usage. For overall score recovery, V2 and KL had similar results, which were slightly better than those for V1 and Vm. If item usage is a major concern, then Ag without any content restriction (Content = 0) is the best choice. The newly proposed procedure, V2 , which selects items that minimize the error variance of the composite score with the optimized weight with content consideration, proved to be promising. For prefixed optimized weight, for population C2, in selecting tests of lengths 18, 36, and 55, this procedure (1) used times of 0.07, 0.14, and 0.22 seconds, (2) had test reliabilities of 0.80, 0.91, and 0.94, (3) had ABSBIASes for the composite of 0.22, 0.15, and 0.12, (4) had correlation for the domains of ranges 0.94–0.97, 0.97–0.99, and 0.99–0.99, (5) had correlations for the composite of 0.98, 0.99, and 0.99, and (6) had a number of selected items of 263, 314, and 357, respectively, out of a pool of size 912, for 1000 simulees. In actual practice, the procedure V2 selects items using the optimized weight derived from the selected items and the current domain ability estimates; the number of items required for each domain should be specified (Content = 1 or 2). The results of this approach are comparable to results from the other procedures. Using the optimized weight for the overall score instead of a simple average is analogous to using item difficulty in IRT modeling instead of the classical number correct score. It also has a higher item usage percentage compared to simple averaging. The item parameters in the pool can be obtained by fitting real data into the MIRT model following the test blue print; the prior for the population is a multivariate normal with zero means and variance-covariance matrix of correlations obtained from raw data. Examinees with higher correlations between the domain abilities have better precision in both the domain scores and the composite scores. By comparing the results of this study with results from similar simulation conditions for the PP format, it is supported that CAT increases the precision and reliability or reduces test length dramatically, especially for the overall score. The G method that selects items with higher discrimination for the general dimension is also examined and compared with the other five procedures for both the estimation of domain scores and the estimation of overall scores. The results for the overall score are no better than those for the five procedures with content consideration (Content = 1 or 2); however, the results for the other composites and the domains for the G method are statistically worse than those for the five procedures. In this study, the item pool contains items of simple structure. The computer programs, SimuMCAT (Yao, 2011), for the five procedures can be applied for pools containing complex structured items; they are available for free download at www.BMIRT.com. To be consistent,
Content
0 1 2 0 1 2 0 1 2
Length
18 18 18 36 36 36 55 55 55
0.25 0.27 0.28 0.21 0.22 0.24 0.18 0.20 0.20
AFQT
ABSBIAS
0.69 0.73 0.71 0.58 0.62 0.62 0.46 0.55 0.54
MA
TABLE 10.
0.74 0.59 0.61 0.70 0.52 0.50 0.65 0.46 0.43
VB 0.75 0.78 0.76 0.66 0.76 0.73 0.60 0.72 0.73
AR 0.81 0.75 0.75 0.80 0.69 0.71 0.80 0.64 0.65
WK 0.79 0.78 0.76 0.79 0.75 0.73 0.77 0.72 0.71
PC 0.70 0.72 0.73 0.63 0.61 0.61 0.55 0.56 0.50
MK 0.94 0.93 0.93 0.96 0.95 0.95 0.97 0.96 0.96
G 0.50 0.41 0.44 0.68 0.63 0.63 0.80 0.71 0.73
MA
Correlations 0.40 0.66 0.65 0.50 0.76 0.78 0.58 0.81 0.84
VB
ABSBIAS, correlations, and time for G method for P3 in selecting 18, 36, and 55 items.
0.35 0.27 0.33 0.56 0.36 0.42 0.66 0.45 0.42
AR
0.12 0.39 0.36 0.21 0.51 0.45 0.19 0.61 0.57
WK
0.17 0.23 0.27 0.18 0.34 0.39 0.25 0.40 0.45
PC
0.42 0.36 0.36 0.55 0.60 0.61 0.69 0.67 0.73
MK 0.29 0.31 0.08 0.60 0.68 0.17 0.93 1.07 0.26
Time LIHUA YAO
PSYCHOMETRIKA TABLE 11. Correlations between the estimated and true abilities for domains and overall for CAT and Paper-and-Pencil format.
Length
CAT: V2 , P2, and population C2 for tests of lengths 18, 36, and 55
18 36 55
AR
WK
PC
MK
AFQT
0.96 0.99 0.99
0.97 0.99 0.99
0.94 0.97 0.99
0.97 0.99 0.99
0.97 0.99 0.99
Paper and pencil: Population of r = 0.5 and sample size 2000
20 32 48 60
Number sense and computational techniques
Algebra, patterns, and functions
Statistics and probability
Geometry and measurement
MA
0.83 0.87 0.89 0.91
0.81 0.87 0.89 0.91
0.78 0.85 0.89 0.89
0.82 0.85 0.87 0.91
0.88 0.92 0.93 0.95
the mathematical theories are provided for items of simple structure. The simulation study and the theoretical proof yield consistent conclusions. Further research, in terms of both simulation studies and mathematical theory, for item pools containing items of both simple structure and complex structure should be conducted. However, experience has shown that the procedures will select items of essentially simple structure and the mathematical theories are conjectured to be similar, with more matrix manipulations. Matrix manipulations are the same as changing the coordinate for the theta space through linking. In future studies, the five procedures with item exposure rate should be compared and the content requirement can be accomplished by extending the Priority Index (Cheng & Chang, 2009) for the unidimensional IRT to MIRT. Although the Bayesian procedure was used, the method and software works for both Bayesian and nonBayesian methods, as specified by the user.
Acknowledgements I would like to thank the reviewers and the editor for their valuable input on the earlier version of this manuscript. I would also like to thank Dan Segall for his helpful comments, and Sherlyn Stahr, Louis Roussos, and my daughter Sophie Chen for their editorial assistance. The views expressed are those of the author and not necessarily those of the Department of Defense or the United States government. References Chang, H.-H., & Ying, Z. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20, 213–229. Cheng, Y., & Chang, H.H. (2009). The maximum priority index method for severely constrained item selection in computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 62, 369–383. De la Torre, J., & Hong, Y. (2010). Parameter estimation with small sample size: a higher-order IRT approach. Applied Psychological Measurement, 34, 267–285. Haberman, J.S., & Sinharay, S. (2010). Reporting of subscores using multidimensional item response theory. Psychometrika, 75, 331–354. Lee, Y.H., Ip, E.H., & Fuh, C.D. (2008). A strategy for controlling item exposure in multidimensional computerized adaptive testing. Educational and Psychological Measurement, 68, 215–232. Li, Y.H., & Schafe, W. (2005). Trait parameter recovery using multidimensional computerized adaptive testing in reading and mathematics. Applied Psychological Measurement, 29, 3–25. Luecht, R.M. (1996). Multidimensional computerized adaptive testing in a certification or licensure context. Applied Psychological Measurement, 20, 389–404.
LIHUA YAO Luecht, R.M., & Miller, T.R. (1992). Unidimensional calibrations and interpretations of composite traits for multidimensional tests. Applied Psychological Measurement, 16, 279–293. Mulder, J., & van der Linden, W.J. (2009). Multidimensional adaptive testing with optimal design criteria for item selection. Psychometrika, 74, 273–296. Reckase, M.D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21, 25–36. Reckase, M.D. (2009). Multidimensional item response theory. New York: Springer. Reckase, M.D., & McKinely, R.L. (1991). The discriminating power of items that measure more than one dimension. Applied Psychological Measurement, 15, 361–373. Segall, D.O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331–354. Segall, D.O. (2001). General ability measurement: an application of multidimensional item response theory. Psychometrika, 66, 79–97. van der Linden, W.J. (1999). Multidimensional adaptive testing with a minimum error-variance criterion. Journal of Educational and Behavioral Statistics, 24, 398–412. Veldkamp, B.P., & van der Linden, W.J. (2002). Multidimensional adaptive testing with constraints on test content. Psychometrika, 67, 575–588. Wang, C., Chang, H.-H., & Boughton, K.A. (2011). Kullback–Leibler information and its applications in multidimensional adaptive testing. Psychometrika, 76, 13–39. Yao, L. (2003). BMIRT: Bayesian multivariate item response theory [Computer software]. Monterey: Defense Manpower Data Center. Yao, L. (2010a). Reporting valid and reliability overall score and domain scores. Journal of Educational Measurement, 47, 339–360. Yao, L. (2010b). Multidimensional ability estimation: Bayesian or non-Bayesian. Unpublished manuscript. Yao, L. (2011). simuMCAT: simulation of multidimensional computer adaptive testing [Computer software]. Monterey: Defense Manpower Data Center. Yao, L., & Boughton, K.A. (2007). A multidimensional item response modeling approach for improving subscale proficiency estimation and classification. Applied Psychological Measurement, 31, 83–105. Yao, L., & Boughton, K.A. (2009). Multidimensional linking for tests containing polytomous items. Journal of Educational Measurement, 46, 177–197. Yao, L., & Schwarz, R.D. (2006). A multidimensional partial credit model with associated item and test statistics: an application to mixed-format tests. Applied Psychological Measurement, 30, 469–492. Manuscript Received: 3 FEB 2011 Final Version Received: 9 SEP 2011