Conditional Item-Exposure Control in Adaptive Testing Using Item

0 downloads 0 Views 566KB Size Report
formulate a version of the method for conditional item-exposure control given y. ... Our final goal is to explore the behavior of the new versions of these methods.
Journal of Educational and Behavioral Statistics December 2007, Vol. 32, No. 4, pp. 398–418 DOI: 10.3102/1076998606298044 Ó 2007 AERA and ASA. http://jebs.aera.net

Conditional Item-Exposure Control in Adaptive Testing Using Item-Ineligibility Probabilities Wim J. van der Linden Bernard P. Veldkamp University of Twente Two conditional versions of the exposure-control method with item-ineligibility constraints for adaptive testing in van der Linden and Veldkamp (2004) are presented. The first version is for unconstrained item selection, the second for item selection with content constraints imposed by the shadow-test approach. In both versions, the exposure rates of the items are controlled using probabilities of item ineligibility given y that adapt the exposure rates automatically to a goal value for the items in the pool. In an extensive empirical study with an adaptive version of the Law School Admission Test, the authors show how the method can be used to drive conditional exposure rates below goal values as low as 0.025. Obviously, the price to be paid for minimal exposure rates is a decrease in the accuracy of the ability estimates. This trend is illustrated with empirical data. Keywords: adaptive testing; conditional item-exposure control; item eligibility method; uniform exposure rates

Items in adaptive tests are selected as a solution to an optimization problem in which an objective function is maximized over the item pool. If the test has to meet a set of content specifications, the optimization becomes an instance of a more complicated constrained combinatorial optimization problem. A popular choice for the objective function in these problems is the value of the information function at the current ability estimate for the items in the pool. Suppose the items have been calibrated using the three-parameter logistic (3-PL) model pi ðyÞ ≡ PrfUij ¼ 1g ≡ ci þ ð1  ci Þ

exp½ai ðy  bi Þ ; 1 þ exp½ai ðy  bi Þ

ð1Þ

where y ∈ ð∞; ∞Þ is a parameter representing the ability of the test taker and bi ∈ ð∞; ∞Þ, ai > 0, and ci ∈ ½0; 1 are the difficulty, discriminating power,

The authors are grateful to Wim M. M. Tielen for his computational assistance.

398

Conditional Item-Exposure Control

FIGURE 1. Dominance curves for levels 1, 2, 3, 4, 5, 10, 20, . . ., 100 for an item pool from the Law School Admission Test. Note: Each curve is composed of the segments of the information functions of the items that dominate the other items at the y values.

and guessing parameter for item i ¼ 1; . . . ; N in the pool, respectively (Birnbaum, 1968). For this model, the item information function is Ii ðyÞ ¼

½p0i ðyÞ2 : ½pi ðyÞ½1  pi ðyÞ

ð2Þ

Let g ¼ 1; . . . ; n denote the items in the adaptive test and b yðg1Þ the estimate of y after the first g  1 items. If the gth item is selected, the objective function that is maximized over the items in the pool is the information in Equation 2 at y¼b yðg1Þ . Because we optimize, the items in the test tend to be picked only from a small subset of the pool. This point is illustrated in the topmost curve in Figure 1, which is composed of the segments of the information functions in Equation 2 that are locally best among the items in a pool for a section from the Law School Admission Test (LSAT). We refer to this curve as the Level 1 dominance curve for the item pool. The small subset of items on which this curve is based (in this case, only 6 items from a pool of 305) dominates the other items in the pool everywhere over the interval; consequently, they are always chosen. The basic message from this curve is that unless special precaution is taken, the majority of the items in the pool are bound to remain inactive. If the adaptive test is administered in a continuous testing program, a set of dominant items is easily memorized. If the stakes for the test takers are high, it thus is possible for a few of them to plot and identify a critical portion of the 399

van der Linden and Veldkamp item pool. Subsequent test takers are then able to familiarize with these items and increase their test scores. Fortunately, selecting items below the Level 1 dominance set does not necessarily involve a large loss. The second curve in Figure 1 shows the Level 2 dominance curve for the same item pool, that is, the curve consisting of the segments of the informations for the locally second-best items in the pool. Because the curves for the two levels hardly differ in height (except at the upper end of the scale where the pool had a few strongly discriminating items), not much accuracy of scoring would be lost if we relaxed the criterion for item selection somewhat and selected items from both subsets. This fact has led to the idea of probabilistic control of item exposure in adaptive testing. The first probabilistic method was proposed by McBride and Martin (1983). Their method simply consisted of picking the items randomly from the first five levels of dominance along the y scale (i.e., the first five curves in Figure 1). The fact that the method is probabilistic is important because it treats all test takers at a given ability level equally fair in the sense that each of them has the same probability of getting each item. A more advanced probabilistic method was proposed by Sympson and Hetter (1985; see also Hetter & Sympson, 1997). This method, which will be discussed in more detail in the next section, is based on a probabilistic experiment that is conducted each time an item is selected. The outcome of this experiment is either the decision to go on and administer the item or to pass and rerun the experiment for the next best item in the pool. The conditional probabilities of administration given the selection of an item are the control parameters used to restrict the item-exposure rates. The values of these parameters are to be set through an iterative process of simulated adaptive test administrations. An alternative method of probabilistic exposure control was presented in van der Linden and Veldkamp (2004). The probability experiment in this method differs from that in the Sympson-Hetter method in the following three aspects. First, the experiment is not conducted each time after an item is selected but only once before a test taker begins the test. Second, the critical parameters in the experiment are not the conditional probabilities of administering an item given it has been selected but the probabilities of the items being eligible for the test taker. If an item is eligible, it is available for administration to the test taker. If the item is ineligible, it is removed from the pool for the test taker. Third, the probabilities of ineligibility are used in a recurrence relation that allows them to adapt automatically to appropriate levels during testing. The differences between the two methods will become precise when we discuss them in more detail in the following. The present article serves three different goals. Our first goal is to generalize the item-ineligibility method, which was developed for use with constrained item selection through the shadow-test method, to any type of item selection in adaptive testing. The generalization makes the basic structure of the method transparent and allows us to highlight some of its features. Our second goal is to 400

Conditional Item-Exposure Control formulate a version of the method for conditional item-exposure control given y. Conditional control is generally accepted as more effective because it reduces the likelihood that test takers of approximately the same ability level detect the subset of items in the pool specific to them (Stocking & Lewis, 1998). As will become clear in the following, to formulate a conditional version of the method, we have to reconceptualize the adaptive test as one conducted from multiple item pools. Our final goal is to explore the behavior of the new versions of these methods when the exposure rates of the items are driven to their minimum. This appears to be possible provided we are willing to pay the price of a decrease in the accuracy of the ability estimates. The Level 10 through Level 100 curves in Figure 1 explain the decrease in accuracy: If the exposure rates are set lower and lower, we are required to select items from the lower dominance levels in the pool and eventually have to accept considerable loss of accuracy in testing. In addition, if content constraints are imposed on the test, lowering the exposure rates may eventually lead to overconstraining of the item selection, namely, the case where no feasible test is left in the pool. Thus, if the conditional item-exposure rates are chosen to be too low, the test becomes ineffective at some of the ability levels. Sympson-Hetter Method To highlight the differences with the ineligibility method in the next sections, we briefly discuss the Sympson-Hetter (hereafter SH) method, which is based on the following two events for each item in the pool: Si : item i is selected; Ai : item i is administered.

Because an item can never be administered without being selected, it always holds that Ai ⊂ Si ;

ð3Þ

for all i. Hence, for the conditional exposure rate of item i given y (i.e., probability PðAi | yÞ), it follows that PðAi | yÞ ¼ PðAi ; Si | yÞ ¼ PðAi | Si ; yÞPðSi | yÞ

ð4Þ

for all possible values of y. The SH method is used to force the item-exposure rates of all items in the pool below an upper bound rmax . Typically, rmax is chosen to be in the range of .20 to .30. From Equations 3 and 4 it follows that the bound is realized when PðAi | Si ; yÞPðSi | yÞ ≤ r max

ð5Þ

for i ¼ 1; . . . ; N. 401

van der Linden and Veldkamp The probability of item selection, PðSi |yÞ, depends on a variety of factors, including the distribution of the item parameters in the pool, the objective function that is optimized, and the initial item that is chosen. Because each of these factors is fixed by design, the only parameters left in Equation 5 to manipulate the exposure rates are the conditional probabilities PðAi |Si ; yÞ. It is always possible to meet the bound in Equation 5 by setting the probabilities PðAi |Si ; yÞ at low values for all items. However, implicit in Equation 4 is the idea that the exposure rates for the best items in the pool should not be much lower than rmax because we do not want to lose them entirely. In other words, rmax should be viewed as a goal value that has to be approached from below rather than just an upper bound. Values for PðAi |Si ; yÞ that approach the goal value cannot be found by an analytic method. Sympson and Hetter therefore proposed to find them using an iterative process of computer simulations that has to be continued until admissible exposure rates are found. At each step in this process, a large set of simulated test administrations is replicated at the y values for which the exposure rates have to be controlled. At the end of the step, the new conditional exposure rates of the items given these values are estimated and the control parameters are adjusted. The fact that the control parameters are set at the true y values, whereas in operational testing we control the exposure rates at estimated y values, is not much of a problem provided the set of y values is well chosen (Stocking & Lewis, 2000). Let t ¼ 1; 2; . . . denote the sets of simulations. The adjustment rule used in the SH method is Pðtþ1Þ ðAi |Si ; yÞ :¼



1 if PðtÞ ðSi |yÞ ≤ r max ; r max =PðtÞ ðSi |yÞ if PðtÞ ðSi |yÞ > r max ;

ð6Þ

where i ¼ 1; . . . ; N. The rule is based on the idea that if at step t an item was selected with a probability smaller than r max , no control is needed. However, if an item was selected with a conditional probability larger than r max , its control parameter should have been set such that PðAi |yÞ ¼ r max . From Equation 4 it follows that this requirement would have been realized if PðtÞ ðAi |Si ; yÞ had been equal to r max =PðtÞ ðSi |yÞ. Hence, the new value of Pðtþ1Þ ðAi |Si ; y) is set at this level. For a more formal treatment of the SH method and some alternative methods based on variations of this adjustment rule, the reader should consult van der Linden (2003). In practical settings, the use of the SH method has been found to be time consuming. The number of y values at which the exposure rates are controlled is usually in the range of 10 to 12. The number of iterations of the adaptive test simulations for one y value is generally of the same order. It is therefore not uncommon to have to run 100 to 200 simulations before an admissible set of exposure rates is found. In addition, if some of the items have to be replaced 402

Conditional Item-Exposure Control during operational testing because they are compromised or appear to be flawed, the values of the control parameters become invalid and the procedure has to be repeated (Chang & Harris, 2002). A more fundamental problem with the SH method is the fact that the itemexposure rates do not necessarily converge to values below rmax during the adjustment process. It can be regularly observed that the rates of some of the overexposed items increase rather than decrease after adjustment. Also, rates that were below rmax for some steps may suddenly jump back to a value larger than this target. Because of this behavior, it is necessary to inspect the exposure rates of all items after each step and use personal judgment to decide when to stop. Conditional Item-Ineligibility Methods We first discuss the case of adaptive testing without content constraints. The modifications of the method necessary to deal with adaptive testing with content constraints are introduced in a later section. Testing Without Content Constraints To formulate the item-ineligibility method we consider the following two events: Ei : item i is eligible; Ai : item i is administered.

If the item is eligible, it remains in the pool during the entire test for the test taker; otherwise it is removed. Unlike the SH method, it is not necessary to allow for an event Si of selecting item i: An item is always administered if it is selected. More formally, it holds that Ai ¼ Si , and we need not consider the latter. Analogous to Equation 3, Ai ⊂ Ei

ð7Þ

for all i. We are therefore able to write PðAi |yÞ ¼ PðAi ; Ei |yÞ ¼ PðAi |Ei ; yÞPðEi |yÞ

ð8Þ

for all possible values of y. If we impose rmax as a goal value for the exposure rates PðAi |yÞ, we obtain PðAi |yÞ ¼ PðAi |Ei ; yÞPðEi |yÞ ≤ r max ;

ð9Þ

403

van der Linden and Veldkamp or PðEi |yÞ ≤

r max ; PðAi |Ei ; yÞ

ð10Þ

PðAi |yÞ ; PðEi |yÞ

ð11Þ

with PðAi |Ei ; yÞ > 0: From Equation 7, PðAi |Ei ; yÞ ¼

with PðEi |yÞ > 0. Hence, Equation 10 can be rewritten as PðEi |yÞ ≤

r max PðEi |yÞ; PðAi |yÞ

ð12Þ

still with PðAi |yÞ > 0. The basic idea is to conceive of Equation 12 as a recurrence relation. Suppose j test takers have already taken the test, and we want to establish the probabilities of eligibility for test taker j þ 1. If rmax is our goal value for the exposure rates at selected points yk , k ¼ 1; . . . ; K, the probabilities of eligibility P ðjþ1Þ ðEi |yk ) can be calculated as   r max Pðjþ1Þ ðEi |yk Þ ¼ min ðjÞ PðjÞ ðEi |yk Þ; 1 ; P ðAi |yk Þ

ð13Þ

with PðjÞ ðAi |yk Þ > 0. The rationale for the recurrence relation in Equation 13 is that it automatically maintains rmax as a goal value for the exposure rates. It is easy to show that if PðjÞ ðAi |yk Þ > r max ; if PðjÞ ðAi |yk Þ ¼ rmax ; if PðjÞ ðAi |yk Þ < r max ;

then Pðjþ1Þ ðEi |yk Þ < PðjÞ ðEi |yk Þ then Pðjþ1Þ ðEi |yk Þ ¼ PðjÞ ðEi |yk Þ then Pðjþ1Þ ðEi |yk Þ > PðjÞ ðEi |yk Þ:

ð14Þ

Thus, if an exposure rate is larger than the goal value, the probability of eligibility of the item always goes down. As a result, because of Equation 7, the expected exposure rate also goes down. On the other hand, if an exposure rate is below r max , the probability of eligibility of the item goes up. Practical Implementation To implement the method for exposure control conditional on a set of values yk , k ¼ 1; . . . ; K, we have to abandon the idea of an adaptive test from a single item pool. Before a person takes the test, the current probabilities of eligibility are used to decide which items are eligible for each of the values yk . The result 404

Conditional Item-Exposure Control of these experiments is K different versions of the item pool, one at each yk . During the test, the person visits the version of the item pool that is closest to his or her current ability estimate. To estimate the probabilities of eligibility, we have to record the following two counts: aij k : number of test takers through j who visited item pool k and took item i; eijk : number of test takers through j who visited item pool k when item i was eligible.

P ðjÞ ðAi |yk Þ and P ðjÞ ðEi |yk ) can then be estimated as aijk =j and eijk =j, and the estimated probability of eligibility for test taker j þ 1 in Equation 13 is obtained as  max  bðjþ1Þ ðEi |yk Þ ¼ min r eijk ; 1 ; P aijk

ð15Þ

with aijk < 0. The estimates in Equation 15 ignore the differences between the test taker’s true and estimated ability. Just as for the SH method, we expect the impact of estimation error on the actual exposure rates to be negligible for all practical purposes (Stocking & Lewis, 2000). The simulation results presented later in this article confirm this expectation. Two different implementations of the method are possible: 1. The method can be used to control the exposure rates on the fly, that is, without any prior adjustment of the eligibility probabilities. 2. The probabilities can be adjusted prior to operational testing through a computer simulation of administrations of the test.

In the previous study with the unconditional version of the method (van der Linden & Veldkamp, 2004), we were able to report that for a typical goal value of rmax ¼ :25, both the probabilities of eligibility and the exposure rates were already stable after 1,000 test administrations. In the empirical study reported later in this article, we minimized the goal values and found that for values close to their minimum (see Equation 25 in the following), the method should be used in combination with the technique of fading to get stability after the same number of administrations. (The technique of fading is explained in a following section.) Whatever implementation is used, it is always possible to deal with the replacement of a few items in the operational pool, for instance, because they appear to be compromised, on the fly. Testing With Content Constraints Usually, content constraints are to be imposed on the adaptive test. If so, the shadow-test approach offers an effective implementation. Shadow tests are fullsize tests calculated prior to the selection of the items that (a) are optimal at the 405

van der Linden and Veldkamp last ability estimate, (b) meet all content constraints, and (c) include all items already administered to the current person. The next item to be administered is the best item among the free items in the current shadow test. Shadow tests can be easily assembled using the technique of 0-1 integer programming (van der Linden, 2000, 2005; van der Linden & Reese, 1998). A natural way to implement the control in adaptive testing with content constraints is through the inclusion of ineligibility constraints in the models for the shadow tests. If the decision is that item i is ineligible for the test taker, the following constraint is added to the model xi ¼ 0;

ð16Þ

where xi is the 0-1 decision variable for item i in the model (that is, if xi ¼ 1, item i is selected but it is not if xi ¼ 0Þ. If item i remains eligible, no constraint is added. The extension of the test-assembly model for the shadow test with these ineligibility constraints gives rise to two new issues. The first is potential overconstraining of the item selection. In principle, it is possible that a temporary combination of content and ineligibility constraints in the model is unfortunate and no feasible solution is left. Generally however, for a typical testing program, the number of ineligibility constraints in the model is small, and the likelihood of an infeasible solution can be ignored. In fact, in our empirical studies with the unconditional version of this method, infeasibility never occurred (van der Linden & Veldkamp, 2004). We expect the same to happen with the conditional version of the method except when the exposure rates are driven to their minimum (see the next section). Also, the likelihood of an infeasibility depends entirely on the appropriateness of the item pool for the test. A method of item-pool assembly that guarantees a balanced distribution of the items in the pool with respect to the content constraints is presented in van der Linden, Ariel, and Veldkamp (2006). For methods of item-pool design that guarantee such distributions, see van der Linden (2005). If infeasibility occurs, a straightforward solution is to remove all ineligibility constraints from the model and use the full pool for item selection. This measure may lead to an occasional extra exposure for some of the items, but the adaptive mechanism in Equation 14 automatically corrects for them. The second issue has to do with the fact that the use of shadow tests reinvokes the distinction between item selection and administration on which the SH method rests. An item can now be selected for the shadow test but may not be administered because it was dominated by the other free items in the test. To deal with these new issues, we distinguish the following possible events: Ei : F: Si: Ai :

406

item i is eligible; the model for the shadow test with the ineligibility constraints is feasible; item i is selected in a shadow test; item i is administered.

Conditional Item-Exposure Control It holds for these four events that Ai ⊂ Si ⊂ fEi ∪Fg;

ð17Þ

where F is the event of an infeasible shadow test. Following the same argument as in van der Linden and Veldkamp (2004, Equations 8 through 13), an upper bound r max on the conditional exposure rates PðAi |yÞ can be shown to lead to PðEi |yÞ ≤ 1 

1 rmax PðEi ∪F|yÞ þ ; PðF|yÞ PðAi |yÞPðF|yÞ

ð18Þ

with PðAi |yÞ > 0 and PðF|yÞ > 0. This inequality implies the following version of Equation 13 for the case of adaptive testing with content constraints:  Pðjþ1Þ ðEi |yk Þ ¼ min 1 

 1 r max PðjÞ ðEi ∪ F|yk Þ þ ; 1 ; PðjÞ ðF|yk Þ PðjÞ ðAi |yk ÞPðjÞ ðF|yk Þ

ð19Þ

still with PðjÞ ðAi |Ei ; yk Þ > 0 and PðjÞ ðF|yk Þ > 0. This equation looks more complicated than Equation 13, but the relation between the two becomes clear if we consider the case that the shadow test is always feasible and set P ðjÞ ðF|yk Þ ¼ 1. It then holds that P ðjÞ ðEi ∪ F|yk Þ ¼ P ðjÞ ðEi |yk Þ, and Equation 13 directly follows from Equation 19. In addition, Equations 13 and 19 have the same adaptive behavior. The only difference between the two relations is a correction of the probability of item eligibility for possible infeasibility in Equation 19. For these and other details, see van der Linden and Veldkamp (2004). To estimate the right-hand probabilities in Equation 19, we need the following counts: njk : number of test takers through j who visited item pool k; aijk : number of test takers through j who visited item pool k and took item i; jjk : number of test takers through j who visited item pool k when the test was feasible; rijk : number of test takers through j who visited item pool k when item i was eligible or the test was infeasible.

The left-hand probabilities are then estimated as (

bðjþ1Þ

P

) njk r max njk rijk ðEi |yk Þ ¼ min 1  þ ; 1 ; jjk aijk jjk

ð20Þ

with aijk > 0; jjk > 0: 407

van der Linden and Veldkamp To begin the test, it is recommend to set Pbðjþ1Þ ðEi |yk Þ ¼ 1 for item i until both aijk > 0 and jjk > 0. This initialization helps us prevent indeterminate values because the conditions in Equation 20 are not yet satisfied. Fading In van der Linden and Veldkamp (2004), it is recommended to update the counts using the technique of fading, which is used in applications of Bayesian networks for updating posterior probabilities. The basic idea underlying this technique is to weigh the old data by a fading factor when new data are added. As a result, the effective size of the sample remains fixed at a predetermined level (Jensen, 2001), and the probabilities of eligibility have the same level of stability throughout operational testing. For a demonstration of the effectiveness of fading for updating eligibility probabilities, see van der Linden and Veldkamp (2004). Suppose we use a fading factor w. The number of test takers njk who visited item pool k in Equation 20 is no longer a direct count but a number updated as n∗ðjþ1Þk ¼ wn∗jk þ 1:

ð21Þ

The updates of the other counts are analogous. For example, the number of test takers who visited pool k and received item i is updated as a∗ðjþ1Þik

 ¼

wa∗ijk þ 1; wa∗ijk ;

if item i was administered to test taker j; otherwise.

ð22Þ

These updates produce estimates of the probabilities based on a moving window with weights close to one for recent test takers but approaching zero for earlier test takers. The method can be shown to have estimates based on an effective sample size equal to 1=ð1  wÞ (Jensen, 2001). In the following empirical study, we used w ¼ :999, which amounts to a sample size of 1,000. The use of fading is particularly important when the goal value for the exposure rates is set close to its minimum (see the next section). Typically, the probabilities of eligibility for the item pool go down to a value smaller than 1 in an order determined by the level of dominance of the items (see Figure 1). When the goal value approaches its minimum, the process has to be continued until the last items in the pool are reached. But when these items become active, the numbers of test takers that have visited the item pools, njk , have already grown large. As a result, Equation 20 adapts only slowly to the changes in the counts of the item administrations, aijk , which had been close to zero thus far. The technique of fading prevents this slower adaptation because the weighted counts in Equation 20 are based on the same effective number of test takers for later items as for earlier ones. 408

Conditional Item-Exposure Control Minimizing Exposure Rates It is tempting to explore how low rmax can be set before the method breaks down. Unfortunately, although we are able to discuss a useful lower bound on the marginal exposure rates, exact bounds for conditional rates are hard to find. Marginal Rates For the marginal exposure rates, the following relation holds for any population of test takers N X

PðAi Þ ¼ n;

ð23Þ

i¼1

where n is the length of the adaptive test. The relation was presented without formal proof in van der Linden (2003). However, a straightforward proof runs as follows: Let xij be an indicator variable equal to 1 if examinee j takes item i and equal to 0 otherwise. Then, for a population of J test takers, J X

xij =J

j¼1

is the empirical marginal exposure rate of item i and N X

xij ¼ n

i¼1

is the common length of the test. Hence, the sum of the exposure rates can be written as N X

PðAi Þ ¼

i¼1

¼

N X

e

J X

i¼1

j¼1

J X

N X

j¼1

¼ n;

e

! xij =J ! xij =J

i¼1

ð24Þ

where e denotes the expectation over replicated test administrations. The transition form the second to the third line in Equation 24 is valid because the test length is supposed to be the same for all j. 409

van der Linden and Veldkamp From a security point of view, it would be ideal if the exposure rates of all items were distributed uniformly with a common low value. From Equation 23, it follows that this type of distribution is only possible for PðAi Þ ¼ nN 1 , for all i:

ð25Þ

To illustrate the capability of the current method to realize a uniform distribution of exposure rates, we simulated adaptive administrations of a test of 25 items from a pool of 305 items. The pool and the test are described in the section on the empirical study that follows. For this combination of test length and pool size, uniform exposure rates are only possible with PðAi Þ ¼ 25=305 ¼ :0814 for all items. We simulated 10,000 administrations of the test to random test takers at y ¼ 2:0; 1:5; . . . 2:0 for each of the goal values rmax ¼ :25; :20; :15; :10, and .0814. (To make the results comparable, the range of y values was chosen to be identical to that in the main study that follows.) The resulting exposure rates for the items are shown in Figure 2. The lower the goal value, the lower the maximum exposure rate and the larger the portion of the item pool that became active in the test. The curve for rmax ¼ :0814 shows a uniform distribution except for minor disturbances at its extremes due to the probabilistic nature of the method. It is interesting to observe the difference between the curves for rmax ¼ :10 and rmax ¼ :0814. Although the difference between the two goal values seems negligible, for the former, some 50 items were still inactive, whereas all items became active for the latter. Conditional Rates For conditional exposure rates, the version of Equation 23 is more complicated. In this case, we have a different set of rates for each item pool at yk . In addition, the number of test takers who visit a pool is no longer fixed but random. Consequently, to derive an expression for the sum of the exposure rates, we have to weigh the sums of the rates for the individual pools by the probability of a visit to the pool. Because the probability of a visit depends on how far the test taker is in the test, we arrive at the following version of Equation 23: K X n X N X

Pg ðAi |yk ÞPgk ¼ n;

ð26Þ

k¼1 g¼1 i¼1

where g ¼ 1; . . . ; n still indexes the items in the adaptive test, Pgk is the probability of a test taker visiting the item pool at yk for the selection of item g, and Pg ðAi |yk Þ is the probability of administering item i during this visit. Because all probabilities in Equation 26 are dependent on each other, it is impossible to use this relation for deriving lower bounds on the conditional item-exposure rates. This conclusion holds a fortiori for adaptive testing with 410

Conditional Item-Exposure Control

FIGURE 2. Estimated marginal exposure rates of the items in the pool for the goal values rmax ¼ .25, .20, .15, .10, and .0814. Note: For each curve, the items on the horizontal axis are ordered by their exposure rates.

item content constraints. We therefore took an empirical approach and explored what happened to the conditional exposure rates when the goal values rmax were systematically decreased in a series of simulated test administrations. Empirical Study An adaptive version of a section of the LSAT was simulated for 10,000 test takers at yk ¼ 2:0; 1:5; . . . ; 2:0. The test consisted of 25 discrete items, whereas the pool had a size of 305 items. The following versions of the test were simulated: 1. Tests with and without content constraints. 2. Tests with conditional item-exposure control at yk ¼ 2:0; 1:5; . . . ; 2:0 with the goal values r max ¼ 1 (no control), .20, .15, .10, .05, and .025.

The content constraints were the usual constraints for the paper-and-pencil version of the LSAT section; they dealt with such attributes as item type and content, answer key distribution, word counts, and gender/minority orientation of the items. The constraints were imposed on the adaptive test using the shadow-test approach. The total number of constraints was equal to 30. The last two levels of exposure control were lower than the bound of .0814 calculated from Equation 25 for the marginal exposure control in the preceding section; they were therefore expected to be critical. The tests administrations were simulated with the maximum-information criterion for item selection in Equation 2. The initial ability estimate was set 411

van der Linden and Veldkamp

FIGURE 3. Estimated conditional exposure rates of the items given y ¼ 1:5, −0.5, 0.5, and 2.0 for adaptive testing without the content constraints. Note: Each curve is for a different goal value rmax ¼ :25, .20, .15, .10, .05, and .025. For each curve, the items on the horizontal axis are ordered by their exposure rates. The maximum values for the curves decrease with rmax .

equal to b yð0Þ ¼ 0 for all test takers. The interim estimates of y were calculated using the method of expected a posteriori (EAP) estimation with a noninformative prior. During the simulations, we recorded (a) the actual exposure rates and (b) the errors in the ability estimates at each yk . The exposure rates for adaptive testing without the content constraints are shown in Figure 3. The four panels in this figure are for yk ¼ 1:5; 0:5; 0:5, and 1.5. (The results at the other values of yk fitted the patterns in these panels exactly and are omitted for space.) Without exposure control, the maximum exposure rates tended to be close to 1.0. The largest rate was that of .99 observed 412

Conditional Item-Exposure Control

FIGURE 4. Estimated conditional exposure rates of the items given y = −1.5, −0.5, 0.5, and 2.0 for adaptive testing with the content constraints. Note: Each curve is for a different goal value rmax ¼ :25, .20, .15, .10, .05, and .025. For each curve, the items on the horizontal axis are ordered by their exposure rates. The maximum values for the curves decrease with rmax . The only exception is the maximum value for r max ¼ :025, which has moved to the second position for y ¼ 1:5 and y ¼ 1:5.

at yk ¼ 0:5: For the conditions with control, the maximum rate decreased systematically with rmax . The lowest goal value of rmax ¼ :025 had a maximum rate of .04 observed for a few items at each yk . The number of items that were still inactive for this goal value varied from 10 to 15 items at yk ¼ 1:5 and 1.5 to 60 to 70 items at yk ¼ 0:5 and 0:5. These results suggest that for the current item pool, rmax ¼ :025 must have been close to the minimum value possible at the two outmost values of yk but that a lower value might have been possible at the yk values in the middle of the scale. 413

van der Linden and Veldkamp TABLE 1 Percentage of Feasible Shadow Tests During Adaptive Testing With Content Constraints for rmax = .025 yk

% Feasible

−2.0

−1.5

−1.0

−0.5

0

0.5

1.0

1.5

2.0

7.1

14.5

9.8

87.2

99.9

57.3

29.0

12.1

16.4

The exposure rates for adaptive testing with the content constraint in Figure 4 show the same general pattern. The only exceptions are the rates for rmax ¼ :025 at yk ¼ 1:5 and 1.5. The curves for this goal value jumped to the second position (that is, directly after the condition without control). The reason for this jump was that the goal value had become much too low to produce feasible shadow test at each of the values yk . Table 1 shows the percentages of feasible shadow tests for rmax ¼ :025 during the simulations. Although infeasibility was hardly a problem at yk ¼ 0, for the more extreme values of yk the percentage of feasibility quickly decreased to 7.1 ðyk ¼ 2:0Þ and 16.4 ðyk ¼ 2:0Þ. Recall that when infeasibility occurs, the items were replaced in the pool. For the higher values of rmax , replacement is an occasional random event, which is automatically corrected for by the adaptive mechanism in Equation 14. However, for rmax ¼ :025 we appeared to have dived below what was possible at the majority of the yk values, and the method was no longer able to cope with the number of replacements. It is interesting to note that for the combination of rmax ¼ :025 and adaptive testing without the content constraints in Figure 3, only 2 out of the 9 × 10; 000 simulated administrations resulted in an infeasible shadow test. The reason for rmax ¼ :025 being too low for adaptive testing with the content constraints was thus not that the number of eligible items in the pools was smaller than the length of the test but that none of the possible combination of eligible items met satisfied the set of content constraints for the test. We also recorded the errors in the ability estimates during the test. Figures 5 and 6 show the estimated bias and mean square error (MSE) functions calculated from these errors. As expected, the curves were generally ordered in the values of rmax ; for smaller goal values, the curves were closer to the horizontal axis. It should be observed that for rmax ¼ :025, the errors were smaller for adaptive testing with the content constraints than without them. This finding might surprise because the addition of content constraints to adaptive testing generally results in less space for optimizing the item selection and consequently larger errors. However, the finding is explained by the fact discussed earlier that rmax ¼ :025 was too low to satisfy the content constraints at the majority of the yk values. Because the algorithm frequently had to return all the items to the pool, the exposure rates of the more dominant items went up and the accuracy of the ability estimation improved. 414

Conditional Item-Exposure Control 0.5

rmax = .025

0.4 0.3

Bias

0.2 0.1

no control 0.0 −2.0 −1.5 −0.1

−1.0

−0.5

x 0.0

0.5

1.0

1.5

2.0

0.5

1.0

1.5

2.0

θ

−0.2 −0.3 −0.4 −0.5

0.5 0.4

MSE

rmax = .025 0.3 0.2 0.1

no control 0.0 −2.0 −1.5

−1.0

−0.5

0.0

θ FIGURE 5. Estimated conditional bias and MSE functions for adaptive testing without content constraints for rmax ¼ :25, .20, .15, .10, .05, and .025. Note: The curves in both panels are ordered in rmax . MSE ¼ mean square error.

Practical Conclusion For a high-stakes testing program, we expect item security to be the number one priority. On the other hand, the accuracy of the ability estimates cannot be sacrificed too much. The empirical study in this article was based on only one 415

van der Linden and Veldkamp 0.5 0.4

rmax = .025

0.3

Bias

0.2 0.1

no control 0.0 −2.0 −1.5 −0.1

−1.0

−0.5

0.0

0.5

1.0

0.5

1.0

1.5

2.0

θ

−0.2 −0.3 −0.4 −0.5

0.5

MSE

0.4 0.3

rmax = .025

0.2 0.1

no control 0.0 −2.0

−1.5

−1.0

−0.5

0.0

1.5

2.0

θ FIGURE 6. Estimated conditional bias and MSE functions for adaptive testing with content constraints for rmax ¼ :25, .20, .15, .10, .05, and .025. Note: The curves in both panels are ordered in rmax . MSE ¼ mean square error.

combination of test and item pool. It is therefore dangerous to generalize. But if we had to choose a goal value for the conditional exposure rates for this combination, our choice would have been rmax ¼ :15 or 10. The bias and MSE curves for the two lower levels of .05 and .025 in Figures 5 and 6 are much more less favorable. Such levels may only become acceptable if the item pool is made larger and/or the test shorter (see the lower bound on the exposure rates in Equation 25). 416

Conditional Item-Exposure Control References Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397-479). Reading, MA: Addison-Wesley. Chang, S.-W., & Harris, D. J. (2002, April). Redeveloping the exposure control parameters of CAT items when a pool is modified. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA. Hetter, R. R., & Sympson, J. B. (1997). Item-exposure in CAT-ASVAB. In W. A. Sands, J. R. Waters, & J. R. McBride (Eds.), Computerized adaptive testing: From inquiry to operation (pp. 141-144). Washington, DC: American Psychological Association. Jensen, F. V. (2001). Bayesian networks and graphs. New York: Springer. McBride, J. R., & Martin, J. T. (1983). Reliability and validity of adaptive ability tests in a military setting. In D. J. Weiss (Ed.), New horizons in testing (pp. 223-236). San Diego, CA: Academic Press. Stocking, M. L., & Lewis, C. (1998). Controlling item exposure conditional on ability in computerized adaptive testing. Journal of Educational and Behavioral Statistics, 23, 57-75. Stocking, M. L., & Lewis, C. (2000). Methods of controlling the exposure of items in CAT. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 163-182). Boston: Kluwer. Sympson, J. B., & Hetter, R. D. (1985, October). Controlling item-exposure rates in computerized adaptive testing. In Proceedings of the 27th annual meeting of the Military Testing Association (pp. 973-977). San Diego, CA: Navy Personnel Research and Development Center. van der Linden, W. J. (2000). Constrained adaptive testing with shadow tests. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 27-52). Boston: Kluwer. van der Linden, W. J. (2003). Some alternatives to Sympson-Hetter item-exposure control in computerized adaptive testing. Journal of Educational and Behavioral Statistics, 28, 249-265. van der Linden, W. J. (2005). Linear models for optimal test design. New York: Springer. van der Linden, W. J., Ariel, A., & Veldkamp, B. P. (2006). Assembling a CAT item pool as a set of linear test forms. Journal of Educational and Behavioral Statistics, 31, 81-100. van der Linden, W. J., & Reese, L. M. (1998). A model for optimal constrained adaptive testing. Applied Psychological Measurement, 22, 259-270. van der Linden, W. J., & Veldkamp, B. P. (2004). Constraining item exposure in computerized adaptive testing with shadow tests. Journal of Educational and Behavioral Statistics, 29, 273-291.

Authors WIM J. VAN DER LINDEN is professor, Department of Research Methodology, Measurement, and Data Analysis, University of Twente, PO Box 217, 7500 AE Enschede, The Netherlands; [email protected]. His areas of interest are test theory, applied statistics, and research methods.

417

van der Linden and Veldkamp BERNARD P. VELDKAMP is assistant professor, Department of Research Methodology, Measurement, and Data Analysis, University of Twente, PO Box 217, 7500 AE Enschede, The Netherlands; [email protected]. His areas of interest are educational measurement and statistics. Manuscript received June 21, 2005 Accepted March 17, 2006

418

Suggest Documents