Chapter 6: Overview of Test Assembly Methods in ...

1 downloads 0 Views 2MB Size Report
May 13, 2013 - Yi Zheng, University of Illinois at Urbana-Champaign. Chun Wang, University of Minnesota. Michael J. Culbertson, University of Illinois at Urbana-Champaign. Hua-Hua Chang, University of Illinois at Urbana-Champaign.
Chapter 6: Overview of Test Assembly Methods in Multistage Testing

Yi Zheng, University of Illinois at Urbana-Champaign Chun Wang, University of Minnesota Michael J. Culbertson, University of Illinois at Urbana-Champaign Hua-Hua Chang, University of Illinois at Urbana-Champaign

May 13, 2013

In D. Yan, A. A. von Davier, & C. Lewis (Eds.), Computerized multistage testing: Theory and applications. New York, NY: CRC Press.

 

1  

Chapter 6: Test Assembly

Zheng et al.

In multistage testing (MST), examinees receive different sets of items from preassembled tests that are matched to provisional estimates of their ability levels. While it has many potential benefits, MST generates new challenges for test assembly due to the large number of possible paths through the test: A well-designed MST must: (a) have distinct information curves between modules in each stage in order to sufficiently differentiate pathways through the test, (b) have sufficiently parallel forms for all pathways in parallel panels, and (c) meet all non-statistical design constraints (such as content balancing and enemy items) across many different possible pathways. This becomes highly demanding especially when the item bank is limited. Although automated test assembly (ATA) algorithms can reduce much of the burden on test developers, these algorithms must be adapted to the increased complexity of MST design. This chapter will first discuss how the approach to MST assembly differs from assembling linear tests, followed by an overview of current ATA methods for MST. Then, we will present a new paradigm for MST assembly called “assembly-on-the-fly,” which borrows well-established item selection algorithms in computerized adaptive testing (CAT) to construct individualized modules for each examinee dynamically (also see Han & Guo, this volume, for related methods). Finally, we will mention several possible directions for future development in MST assembly. 2.1 MST framework A popular framework for MST is based on parallel panels, which constitute the primary assembly and administration unit. The panel is divided into several adaptive stages, each of which consists of one or more alternative modules (i.e., groups of items). Modules in the same stage are anchored at different difficulty levels. During the test, examinees are routed to the most suitable module in each stage based on their performance in the previous stages. The set of modules a given examinee receives is called a pathway (for more information on the MST framework, see Yan, Lewis, & von Davier, this volume). As for linear tests, MST developers often wish to create multiple parallel panels for test security, efficient use of banked items, or repeated testing opportunities. In linear testing, items are assembled into forms, which are considered

 

2  

Chapter 6: Test Assembly

Zheng et al.

parallel if their test information functions or other alternative criteria are sufficiently similar (Samejima, 1977). In MST, the pathway is equivalent to a linear test’s forms, but different pathways in a panel are never parallel because they emphasize different difficulty levels. Rather, two MST panels are considered parallel if all of their corresponding pathways are parallel. Note that even when pathways are parallel, individual modules need not necessarily be parallel. Thus, MST assembly consists of grouping items into modules and modules into panels optimally according to three goals: (a) to make information curves of modules in a stage sufficiently distinct to provide adaptivity between stages, (b) to make information curves of corresponding pathways across panels sufficiently similar to achieve parallel panels, and (c) to meet all non-statistical constraints for every pathway in each panel. Due to the large number of pathways (forms), meeting all three goals becomes highly demanding, especially when the item bank is limited. 2.2 MST assembly design The MST framework provides flexibility for a wide variety of panel design elements, including the number of stages, the number of alternative modules in each stage, and the difficulty anchor and number of items for each module. For parallel panels, the numbers of alternative forms for each module should also be determined based on factors such the desired exposure rates of each module (See for instance Breithaupt, Ariel, & Veldkamp, 2005, and Zheng et al., 2012). Additionally, statistical and nonstatistical characteristics must be defined for the modules (and potentially pathways). Statistical characteristics often take the form of a target test information function (TIF; see Section 2.4 for methods to determine target TIFs). All design parameters must take into consideration the supply of the item bank. In fact, the limitations of the item bank itself are often the most influential constraint in practical test assembly. Once the panel design has been established, assembly of parallel MST panels usually proceeds in two steps: First, modules are assembled from items in the item bank; then, panels are assembled from the resulting modules. There are two main strategies for

 

3  

Chapter 6: Test Assembly

Zheng et al.

achieving parallelism across panels (Luecht and Nungester, 1998): “bottom-up” and “top-down.” In bottom-up assembly (e.g. Luecht, Brumfield, & Breithaupt, 2006), parallelism is assured by assembling parallel forms for each module. These parallel modules are then mix-and-matched to build a large number of parallel panels. Since the alternative forms of each module are parallel, corresponding pathways in the resulting panels will automatically be parallel. Generally, the bottom-up approach is easier to implement, when the item bank and constraints make it feasible. In top-down assembly (e.g., Belov & Armstrong, 2005; Breithaupt & Hare, 2007; Zheng et al., 2012), modules are first assembled with or without respect to parallelism, and then an additional round of optimization takes place at the panel level to achieve parallelism and meet non-statistical constraints. This strategy may be useful for short tests where constraints cannot be broken down evenly for each module, and therefore can only be specified at the pathway level. In this case, constraints will be applied unevenly to different modules in order to generate an initial set of modules, and alternative forms of each module are allowed to differ in test characteristics, as long as the final pathways are parallel and meet the requisite constraints (e.g. Zheng et al., 2012). Even when parallel modules could be constructed for the bottom-up strategy, the top-down approach permits greater control over panel properties, such as preventing enemy items across modules (e.g., Breithaupt & Hare, 2007). 2.3 Automated assembly for MST Keeping track of all the MST assembly conditions quickly becomes arduous by hand. Fortunately, various automated test assembly (ATA) algorithms can be adapted for MST. By breaking assembly into two steps, ATA algorithms can first be applied to assemble individual modules from items, and then be applied again to assemble panels from modules. 2.3.1 Early test assembly methods Originally, tests were designed and created on a small scale by hand, relying on little information from measurement theories. With the advent of modern measurement

 

4  

Chapter 6: Test Assembly

Zheng et al.

theories, practitioners began to utilize quantitative indices in analyzing and generating tests. The matched random subtests method (Gulliksen, 1950), one of the early test assembly methods, was based on two statistics from the classical test theory (CTT) perspective: All items are placed in a two-dimensional space formed by CTT item difficulty and discrimination parameters. Then, items closest to each other form pairs, and the items in each pair are randomly assigned to two subtests. With the advent of item response theory (IRT), the test information function (TIF) replaced item difficulty and discrimination as the primary statistic for controlling test assembly. The TIF is the reciprocal of the lower bound of the squared standard error of measurement. By controlling the test information curve, the level of measurement error is controlled. Lord (1977) proposed an assembly procedure that sequentially selects items to fill the area underneath the target test information curve. At about the same time, Samejima (1977) proposed the concept of “weakly parallel tests,” defined as “tests measuring the same ability or latent trait whose test information functions are identical” (p. 194). The principle of matching TIFs remains central to the mainstream test assembly methods. Generally, ATA algorithms specify a criterion function based on a target TIF (see Section 2.4) and optimize the criterion subject to a set of constraints. 2.3.2 The 0-1 programming methods One of the main optimization approaches in ATA specifies the composition of forms as a point in a high-dimensional binary (0-1) space. Each axis of the space corresponds with an item, and the coordinates of the point indicate whether the given item is assigned to the form. Then, 0-1 programming methods, a subset of linear programming, are used to optimize an objective function over the binary space, subject to multiple constraints (for more details, see van der Linden & Guo, van der Linden & Diao, this volume, and van der Linden, 2005). Common objective functions include the test information function, the deviation of the information of the assembled test from the target, and the differences among multiple parallel test forms. For example, the optimization problem may involve maximizing the test information subject to a fixed test length, expected test time, content constraints, and enemy item specifications:

 

5  

Chapter 6: Test Assembly

Maximize

! !!!

! !!! 𝐼! (𝜃! )𝑥!

Zheng et al.

,

(2.1)

subject to 𝑥! ∈ 0, 1 , 𝑖 = 1, … , 𝐼, ! !!! 𝑥!

= 𝑛,

! !!! 𝑡! 𝑥! (!)

𝐶!



≤ 𝑇! , !∈!!!

!∈!! 𝑥!

𝑥! ≤ 𝐶! ! , 𝑟 = 1, … , 𝑅,

≤ 1, 𝑒 = 1, … 𝐸,

(2.2) (total test length)

(2.3)

(total expected time)

(2.4)

(content bounds)

(2.5)

(mutually exclusive items)

(2.6)

where 𝜃! , … , 𝜃! are K representative monitoring locales on the ability scale, 𝑥! indicates whether item i is included in the test, I is the total number of items in the item bank, 𝑉!! is the set of items belonging to content category r, and 𝑉! is the set of enemy items. Feuerman & Weiss (1973) and Yen (1983) first suggested using 0-1 programming for test assembly, but the first application of the method to ATA was by Theunissen (1985, 1986). Since then, the method has been enriched and is widely known and used. Now 0-1 programming techniques have been developed to simultaneously assemble multiple parallel test forms, and the algorithms can satisfy both absolute and relative targets (e.g., Boekkooi-Timminga, 1989; van der Linden, 2005). In addition to the TIF, the optimized objective function can be defined from various perspectives, such as CTT indices (Adema & van der Linden, 1989), test characteristic curves (BoekkooiTimminga, 1990; van der Linden & Luecht, 1994), multidimensional IRT indices (Veldkamp, 2002), and Bayesian statistics (Veldkamp, 2010). Examples of other current advancements include a linearization approach to approximate the objective function (Veldkamp, 2002), the shadow test approach (van der Linden, 2010; Veldkamp, this volume) for CAT, and the greedy shadow test approach (Veldkamp, 2010). ATA via 0-1 programming searches for a single optimized solution for one or more parallel test forms that strictly satisfy all test assembly constraints. However, as the complexity of the constraints increases, the existing item bank may no longer suffice to

 

6  

Chapter 6: Test Assembly

Zheng et al.

meet all of the constraints, resulting in an over-constrained optimization problem for which no solution exists, termed infeasibility. The threat of infeasbility is particularly challenging when several parallel forms are required, since the number of constraints increases in proportion to the number of parallel forms (van der Linden, 2005). Timminga (1998), Huizting (2004), and Huizting, Veldkamp, and Verschoor (2005) have discussed strategies for finding and circumventing the causes of infeasibility conditions in test assembly. In two-step MST assembly, 0-1 programming models are first specified to assemble individual modules. After the modules are assembled, new 0-1 programming models are specified based on the assembled modules and panel-level targets and constraints to assemble the desired panels. In the recent literature on MST, Ariel, Veldkamp, and Breithaupt (2006); Breithaupt, Ariel, and Veldkamp (2005); Breithaupt and Hare (2007); and Luecht, Brumfield, and Breithaupt (2006), among others, provide detailed descriptions of MST assembly using 0-1 programming methods. 2.3.3 Heuristic methods Alternatives to ATA via 0-1 programming include heuristic methods. Unlike 0-1 programming methods, which attempt to assemble all test forms simultaneously in a single optimization procedure, heuristic-based ATA methods break down test assembly into a sequence of local optimization problems that each select a single item to add to the test (Lord, 1977; Ackerman, 1989). The criterion function is usually based on a “central” criterion (such as the TIF), which is penalized by various “peripheral” constraints (such as content coverage). Since heuristic methods select items sequentially, these methods are “greedy,” meaning that the tests assembled earlier have access to more-satisfactory items than those assembled later, which select items from a diminished item pool. Consequently, heuristic ATA methods must incorporate strategies (e.g., Ackerman, 1989) to offset the “greediness” in order to balance the quality of the assembled forms. One strategy for balancing form quality is to iteratively select one item for each form instead of assembling entire forms at once. The order in which test forms receive

 

7  

Chapter 6: Test Assembly

Zheng et al.

items may be spiraled, randomized, or determined according to the extent of deviation of the current TIF (or other metric) from the target. Another strategy allows initial assembly to proceed greedily, followed by a “swapping” step to exchange items between forms in order to achieve smaller between-form difference (Ackerman, 1989; Swanson & Stocking, 1993). Heuristic-based ATA methods can incorporate non-statistical constraints in a number of ways. For example, the weighted deviation model (WDM; Swanson & Stocking, 1993) and the normalized weighted absolute deviation heuristic (NWADH; Luecht, 1998) treat all constraints as targets and form the criterion as the weighted sum of (normalized) deviations from the targets. The WDM method (Swanson & Stocking, 1993) minimizes the weighted sum of deviations formulated by: ! !!! 𝑤! 𝑑!!

+

! !!! 𝑤! 𝑑!! ,

(2.7)

where d!! is the difference between the assembled test form and the upper bound in constraint j when its upper bound is exceeded, d!! is the difference from the lower bound when the lower bound is not met, and 𝑤! is the weight assigned to constraint j. For constraints on a continuous scale, such as information-based constraints, these deviations are simply the numeric differences in those constraints. For categorical constraints, such as content balancing, deviations are computed from an index based on item membership. For example, suppose the test length is n and there are already k −1 items in the test, then for candidate item t in the available item bank, the index is computed as: ! !!! 𝑎!" 𝑥!

+ 𝑛 − 𝑘 𝑣! + 𝑎!" ,

(2.8)

where 𝑎!" ∈ 0, 1 indicates whether item i possesses property j, 𝑥! ∈ 0, 1 indicates whether item i has been included in the test, and 𝑣! is the average occurrence of property j in the currently available item bank. The first term in the index is the number of previously selected items relevant to the given constraint, the second term adjusts the index by the expected accumulation for the remainder of the test to make the index

 

8  

Chapter 6: Test Assembly

Zheng et al.

comparable with test-wide targets, and the last term reflects the relevance of candidate item t to the given constraint. This quantity is then compared to both the upper and lower bounds to produce the expected deviations as 𝑑!! and 𝑑!! in Equation 2.7. The NWADH method also uses weighted deviations from constraint targets, but it normalizes the deviations for each constraint so they are on a common scale. Let 𝑢! denote the value of the relevant attribute of item 𝑖 = 1, … , 𝐼 (e.g., item information or membership in a content area), and T denote the corresponding target, then to select the kth item into the test, the locally normalized absolute deviation for every candidate item t in the remaining pool is computed by 𝑒! = 1 −

!! !∈!!!! !!

, 𝑡 ∈ 𝑅!!!

(2.9)

and 𝑑! =

!! !!!! !! !! !!!!!

− 𝑢! , 𝑡 ∈ 𝑅!!! ,

(2.10)

where 𝑅!!! is the set of remaining items in the item bank after (𝑘 − 1) items have been selected into the test. The deviation 𝑑! calculates the absolute difference between the candidate item’s contribution toward the target T and the average contribution necessary for each remaining item in order to achieve the target, and the item with the smallest normalized absolute deviation, 𝑒! , will be selected into the test. In the maximum priority index (MPI; Cheng & Chang, 2009; Cheng, Chang, Doublas, & Guo, 2009), originally proposed as an item selection method for constrained CAT, the central criterion (e.g., Fisher information) is multiplied by a factor computed from the number of remaining items permitted by each constraint. Denote the constraint relevancy matrix by C, a JÍK matrix, with 𝑐!" = 1 indicating constraint k is relevant to item j and 𝑐!" = 0 otherwise. Each constraint is associated with a weight 𝑤! . The priority index for item j is computed as: 𝑃𝐼! = 𝐼!

 

! !!!

𝑤! 𝑓!

!!"

,

(2.11)

9  

Chapter 6: Test Assembly

Zheng et al.

where 𝑓! measures the scaled “quota left” for constraint k. In a two-phase framework, items are first selected to satisfy all lower bounds (Phase I) and 𝑓! =

!! !!! !!

,

(2.12)

where 𝑙! is the lower bound for constraint k and 𝑥! is the number of previously selected items relevant to constraint. After the lower bound of constraint k has been reached (Phase II), the priority index shifts to ensure the upper bound is not violated with 𝑓! =

!! !!! !!

,

(2.13)

where 𝑢! is the upper bound for constraint k. This method can also deal with quantitative constraints and it was later modified for a single-phase framework (Cheng, Chang, Douglas, & Guo, 2009). As with the 0-1 programming approach, heuristic ATA algorithms can be applied to MST assembly using the two-step strategy: items are first assembled into modules and then the resulting modules are assembled into panels. While in principle most heuristic ATA methods could be used for MST assembly, only the NWADH method (Luecht, 1998) has been adapted to MST assembly (e.g., Luecht & Nungester, 1998; Zheng et al., 2012). As in the assembly of linear tests, there is a trade-off between the satisfaction of multiple constraints and the complexity and feasibility of the algorithms. The 0-1 programming methods are not guaranteed to yield a solution, but any solution that is achieved strictly satisfies all constraints. The heuristic methods always produce a result and are less computationally intense, but they do not guarantee that all of the constraints will be met. In practice, often some non-statistical attributes, such as content category, are correlated with item difficulty, making it more difficult for every pathway to meet all of the specified constraints. In this case, it may be necessary to relax certain constraints for some pathways. The heuristic methods naturally provide this flexibility, meeting constraints where feasible and producing potentially suitable results when constraints are infeasible.

 

10  

Chapter 6: Test Assembly

Zheng et al.

2.3.4 Other ATA methods Besides 0-1 programming and heuristic approaches, a few other ATA approaches have been proposed. For example, Armstrong, Jones, and Wu (1992) proposed a two-step procedure to assemble parallel tests from a seed test using the transportation algorithm. Armstrong, Jones, and Kunce (1998) transformed the 0-1 programming problem into a network-flow programming problem. Belov and Armstrong (2005) proposed a method for Monte Carlo test assembly in which items are sampled randomly from the item bank. Similarly, Chen, Chang, and Wu (2012) proposed two random sampling and classification procedures—the Cell Only method and the Cell and Cube method—to match the joint distribution of difficulty and discrimination parameters of assembled test forms and that of a reference test. Among these ATA methods, Belov and Armstrong’s (2005) Monte Carlo approach has been adapted to the MST context (Belov & Armstrong, 2008). 2.4 Setting difficulty anchors and information targets for modules The greatest difference between MST assembly and linear test assembly comes in setting the difficulty anchors and TIF targets for each module. Linear tests generally require only a single TIF target; however, MST requires separate targets for each module in a given stage. Moreover, the difficulty anchors for modules in a stage should be properly spaced to provide sufficiently distinct TIFs for valid routing (see Verschoor & Eggen, this volume, for more details on routing), and TIF targets should be both optimized and reasonable with regard to the given item bank (Luecht & Burgin, 2003). Controlling the TIF at every point along the ability scale is impossible. Instead, analysis focuses on only a few discrete ability points, since the TIF is continuous and well behaved, and test developers are often interested primarily in certain critical ability levels (van der Linden & Boekkooi-Timminga, 1989). This latter point is especially applicable to MST, where the difficulty anchors of each module are usually of greatest interest. When these special points exist, such as in licensure or classification exams,

 

11  

Chapter 6: Test Assembly

Zheng et al.

classification boundaries provide natural anchor points, and the TIF targets can be set to maximize the information at those bounds. Various approaches have been proposed to compute test information targets that are optimized and reasonable for a given item bank. For licensure or classification MST, a common approach is first to assemble several alternative forms for each module sequentially, greedily maximizing the TIF at the corresponding difficulty anchors. This creates a range of possible forms supported by the item bank, from most to least optimal. The final reasonably optimized TIF targets are taken as the average of TIFs of the assembled alternative forms (Luecht, 2000; Luecht & Burgin, 2003; Breithaupt & Hare, 2007; Chen, 2011; and Zheng et al., 2012). For some ranking tests where test scores are reported instead of the classification of examinees, TIF targets can be computed without setting difficulty anchors. First, linear tests (Belov & Armstrong, 2008; Jodoin et al., 2006) or CATs (Armstrong et al., 2004; Armstrong & Roussos, 2005; Patsula, 1999) with the appropriate length and constraints are assembled. Then, harder/easier linear tests or CAT are assembled for examinees with higher/lower abilities are used to compute the TIF targets for harder/easier modules. 2.5 “On-the-fly” MST assembly In this section, we will present a new MST assembly paradigm called “on-the-fly” assembly. To better explain the rationale behind the on-the-fly assembly design, we will start by discussing the relationship between MST, and CAT. According to Hendrickson (2007), MST was proposed as a “balanced compromise” between linear tests and CAT. On one hand, it provides adaptivity, retaining CAT’s advantage of short tests and reduced examinee burden; on the other hand, like linear tests, it provides test developers with opportunities to review test forms before administration developers and allows examinees to skip questions and change answers within each stage (see Yan et al., this volume, for more details). Despite these advantages, MST does share some of the limitations of fixed-form linear tests. First, with tests pre-assembled around a few difficulty anchors, MST may not

 

12  

Chapter 6: Test Assembly

Zheng et al.

provide satisfactory trait estimates for those examinees at the two ends of the ability scale for ranking tests. Second, with items bundled together in groups, the test overlap rate may be high among examinees with similar abilities (Wang, Zheng, & Chang, under review). If the test items are shared among friends or disclosed on the Internet, examinees of similar ability who happen to receive the same panel will likely take the same pathway and may be able to answer almost all the compromised items correctly. Finally, facilitating continuous testing requires a large number of parallel panels, but constructing parallel panels in MST is much more demanding than assembling parallel linear tests, especially when the item bank is limited and multiple constraints need to be satisfied. 2.5.1 The on-the-fly MST assembly paradigm We can overcome some of the limitations of MST by borrowing a feature of CAT: Instead of assembling each panel for generic examinees ahead of time, each stage of the test can be assembled dynamically for each examinee “on the fly” based on welldeveloped CAT item selection algorithms (OMST, Zheng & Chang, 2011a; 2011b). In the first stage, since no information is available about examinees’ ability levels, examinees receive a module randomly selected from several pre-assembled parallel forms that provide sufficient information across a wide range of the ability scale (e.g., the “flat” TIF of van der Linden & Boekkooi-Timminga, 1989). Before each subsequent stage, an individualized module is assembled for each examinee based on the examinee’s provisional ability estimate (Figure 2.1) using an appropriate constrained CAT item selection method, such as the MPI method (Cheng & Chang, 2009), the WDM method (Swanson & Stocking, 1993), the NWADH method (Luecht, 1998), or the shadow test approach (van der Linden, 2010; van der Linden & Guo, van der Linden & Diao, this volume). In CAT, each item is administered as soon as it is selected and the ability estimate is updated after each item; but in OMST, if heuristic methods are used, a group of items are sequentially selected based on the same ability estimate but administered together in each stage. When the shadow test approach is used in CAT, the most informative item from the assembled shadow test is administered; in OMST, the n most

 

13  

Chapter 6: Test Assembly

Zheng et al.

informative items from the shadow test can be administered, with n being the upcoming stage length.

Figure 2.1. On-the-fly MST assembly. Item bank usage can be controlled by automated algorithms in OMST as well. On one hand, we should prevent some items from being over-exposed in case of disclosure and sharing among examinees. The Sympson-Hetter method (Sympson & Hetter, 1985) has been most widely adopted for this purpose. On the other hand, to reduce the number of items never or rarely administered, stratifying the item bank by anticipated exposure rate is recommended: First, a complete simulation of the Sympson-Hetter controlled OMST without item bank stratification is carried out. Then, the items in the item bank are partitioned into two sub-banks according to their exposure rates. The items with the lowest exposure rates are assigned to the under-used bank and used in the first stage, and the remaining is assigned to the well-used bank and used in the subsequent stages. Since test developers have greater control over items selected in the static first stage than in the dynamically selected subsequent stages, specifically placing otherwise under-exposed items in the first stage improves their usage.

 

14  

Chapter 6: Test Assembly

Zheng et al.

The item bank stratification procedure is based on the rationale of Chang and Ying’s (1999) a-stratification method. The under-used bank tends to have relatively low a-parameters than the complete item bank since these items are generally selected less often. According to Chang and Ying’s (1999) paper, low-a items are a better choice for the beginning stage, because when we have no knowledge of the examinee at all, low-a items tend to provide greater global information than high-a items. In other words, at the initial stage, we need low-a items to shed light on a wider range of possible θ values. This also saves the high-a items for later stages, when we have a naive estimate of θ and need high-a items to provide greater discriminating power in the neighborhood of the estimated θ location. OMST maintains the multistage structure of the classical MST design, but examinees receive more “individualized” tests. Within each stage, the number of alternative modules will be much larger than in the classical MST design, maximally the same as the examinee sample size. Because OMST provides more flexibility to adapt to examinees’ ability estimates than is possible with pre-assembled panels, it is particularly advantageous for estimating examinees at the ends of the ability scale. OMST also frees test developers from the burdensome requirement of developing many parallel panels. Moreover, without fixed forms, OMST reduces the probability that the test overlap rate is extremely high among some examinees, enhancing test security (Wang, Zheng, & Chang, under review). Meanwhile, because OMST selects items only between stages, examinees may navigate freely within stages to review and revise their answers. As in CAT, OMST requires test developers to give up the opportunity to review completed test forms before administration to achieve these benefits. However, when there are a great number of test forms, human review may prove too time-consuming and expensive in any case, and OMST may be more practical, as long as the selected ATA algorithms provide satisfactory quality control. 2.5.2 Future research in on-the-fly test assembly The basic framework of OMST suggests a number of potential future research paths in on-the-fly MST assembly to further improve the test performance. For example,

 

15  

Chapter 6: Test Assembly

Zheng et al.

since there is no inherent requirement that each stage have the same number of items, how can the length of each stage be tuned to yield the best performance? At the beginning of the test when not much information about examinee ability has been gathered, longer stages may be needed to provide accurate estimates before selecting items for the next stage (Chang & Ying, 1996). In later stages when the estimate is close to its true value, shorter stages can provide more-highly tailored test information, similar to the CAT design. As the stage length decreases, the test transitions smoothly from MST to CAT. Similarly, OMST could also adjust the width of the target ability window adaptively during the test. Since early ability estimates have substantial measurement error, instead of maximizing test information at a single point given by the ability estimate, individualized modules could maximize test information within an interval around the provisional ability estimate that shrinks to 0 as the test progresses. This integration interval reduces the likelihood that uninformative items are selected if the ability estimate is far away from the true ability (Chang & Ying, 1996). 2.5.3 MST, CAT, and other designs—which way to go? No single design—whether CAT, MST, or another—can adequately serve all testing programs universally. The appropriateness of different test designs must be evaluated case by case (also see Yan et al., this volume). Some tests are composed of natural item groups, such as items that share the same passage, while other tests are not. Some tests serve for licensure or classification purposes, while others are intended for ranking. In low-stakes diagnostic scenarios such as patient-reported outcome (PRO) assessment in medical practice and brief in-class educational assessments, reducing test length is a priority. Test design decisions should be made according to specific needs determined by test use. Moreover, the available item bank also plays a significant role in making these design decisions. For example, when the supply of the item bank is limited and assembly constraints are relatively complex, the psychometric advantages that differentiate various adaptive designs may diminish.

 

16  

Chapter 6: Test Assembly

Zheng et al.

As MST becomes more prevalent in operational testing programs, the number and variety of available designs will certainly grow to match the diversity of measurement scenarios. Given the complexity of MST panel design, MST assembly will need flexible paradigms to adapt to ever evolving testing demands. On-the-fly MST extends the MST framework with the flexibility of CAT, opening a new avenue for more-flexible hybrid adaptive test designs to meet new measurement challenges as they arise.

References Ackerman, T. A. (1989, March). An alternative methodology for creating parallel test forms using the IRT information function. Paper presented at the Annual Meeting of the National Council on Measurement in Education, San Francisco, CA. Adema, J. J., & van der Linden, W. J. (1989). Algorithms for computerized test construction using classical item parameters. Journal of Educational Statistics, 14(3), pp. 279-290. Ariel, A., Veldkamp, B. P., & Breithaupt, K. (2006). Optimal Testlet Pool Assembly for Multistage Testing Designs. Applied Psychological Measurement, 30(3), 204–215. Armstrong, R. D., Jones, D. H., Koppel, N. B., & Pashley, P. J. (2004). Computerized adaptive testing with multiple-form structures. Applied Psychological Measurement, 28(3), 147-164. Armstrong, R. D., Jones, D. H., & Kunce, C. S. (1998). IRT test assembly using networkflow programming. Applied Psychological Measurement, 22(3), 237-247. Armstrong, R. D., Jones, D. H., & Wu, I. (1992). An automated test development of parallel tests from a seed test. Psychometrika, 57(2), 271-288. Armstrong, R. D., & Roussos, L. (2005). A method to determine targets for multi-stage adaptive tests. (No. 02-07). Newton, PA: Law School Admission Council. Belov, D. I., & Armstrong, R. D. (2005). Monte Carlo test assembly for item pool analysis and extension. Applied Psychological Measurement, 29(4), 239-261. Belov, D. I., & Armstrong, R. D. (2008). A Monte Carlo approach to the design, assembly, and evaluation of multistage adaptive tests. Applied Psychological Measurement, 32(2), 119-137. Berger, J. O. (2008). Sequential analysis. In S. N. Durlauf & L. E. Blume (Eds.), The New Palgrave Dictionary of Economics (Second Edition). Houndmills, Basingstoke, Hampshire, UK: Palgrave Macmillan. Boekkooi-Timminga, E. (1990). The construction of parallel tests from IRT-based item banks. Journal of Educational Statistics, 15(2), 129-145.

 

17  

Chapter 6: Test Assembly

Zheng et al.

Breithaupt, K., Ariel, A., & Veldkamp, B. P. (2005). Automated Simultaneous Assembly for Multistage Testing. International Journal of Testing, 5(3), 319–330. Breithaupt, K., & Hare, D. R. (2007). Automated simultaneous assembly of multistage testlets for a high-stakes licensing examination. Educational and Psychological Measurement, 67(1), 5-20. Chang, H-H., Qian, J., & Ying, Z. (2001). A-stratified multistage computer adaptive tesing with b blocking. Applied Psychological Measurement, 25(4), 333-341. Chang, H-H., & Ying, Z. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20(3), 213-229. Chang, H-H., & Ying, Z. (1999). A-stratified multistage computer adaptive testing. Applied Psychological Measurement, 23(3), 211-222. Chen, L-Y. (2011). An investigation of the optimal test design for multi-stage test using the generalized partial credit model (Doctoral dissertation). The University of Texas at Austin, TX. Chen, P. H., Chang, H. H., & Wu, H. (2012). Item Selection for the Development of Parallel Forms From an IRT-Based Seed Test Using a Sampling and Classification Approach. Educational and Psychological Measurement, 72(6), 933–953. Cheng, Y., & Chang, H-H. (2009). The maximum priority index method for severely constrained item selection in computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 62, 369-383. Cheng, Y., Chang, H-H., Douglas, J., & Guo, F. (2009). Constraint-weighted astratification for computerized adaptive testing with nonstatistical constraints. Educational and Psychological Measurement, 69(1), 35-49. Feuerman, M., & Weiss, H. (1973). A mathematical programming model for test construction and scoring. Management Science, 19(8), 961-966. Gulliksen, H. (1950). Theory of mental tests. New York, NY: Wiley. [Hillsdale NJ: Erlbaum, 1987.] Hendrickson, A. (2007). An NCME instructional module on multistage testing. Educational Measurement: Issues and Practice, 26, 44-52. Huitzing, H. A. (2004). Using Set Covering with Item Sampling to Analyze the Infeasibility of Linear Programming Test Assembly Models. Applied Psychological Measurement, 28(5), 355–375. Huitzing, H. A., Veldkamp, B. P., & Verschoor, A. J. (2005). Infeasibility in Automated Test Assembly Models: A Comparison Study of Different Methods. Journal of Educational Measurement, 42(3), 223–243. Jodoin, M. G., Zenisky, A., & Hambleton, R. K. (2006). Comparison of the psychometric properties of several computer-based test designs for credentialing exams with multiple purposes. Applied Measurement in Education, 19(3), 203-220.

 

18  

Chapter 6: Test Assembly

Zheng et al.

Lord, F. M. (1971). Tailored testing, an application of stochastic approximation. Journal of the American Statistical Association, 66(336), 707-711. Lord, F. M. (1977). Practical applications of item characteristic curve theory. Journal of Educational Measurement, 14(2), 117-138. Luecht, R. M. (1998). Computer-assisted test assembly using optimization heuristics. Applied Psychological Measurement, 22, 224-236. Luecht, R. M. (2000, April). Implementing the computer-adaptive sequential testing (CAST) framework to mass produce high quality computer-adaptive and mastery tests. Paper presented at the Annual Meeting of the National Council on Measurement in Education, New Orleans, LA. Luecht, R. M., Brumfield, T., & Breithaupt, K. (2006). A testlet assembly design for adaptive multistage tests. Applied Measurement in Education, 19(3), 189-202. Luecht, R. M., & Burgin, W. (2003, April). Test information targeting strategies for adaptive multistage testing designs. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, IL. Luecht, R. M., & Nungester, R. J. (1998). Some practical examples of computer-adaptive sequential testing. Journal of Educational Measurement, 35(3), 229-249. Patsula, L. N. (1999). A comparison of computerized adaptive testing and multistage testing. (Doctoral dissertation). University of Massachusetts Amherst, MA. Samejima, F. (1977). Weakly parallel tests in latent trait theory with some criticisms of classical test theory. Psychometrika, 42(2), 193-198. Swanson, L., & Stocking, M. L. (1993). A model and heuristic for solving very large item selection problems. Applied Psychological Measurement, 17(2), 151-166. Sympson, J. B., & Hetter, R. D. (1985, October). Controlling exposure-rates in computerized adaptive testing. Paper presented at the Annual Meeting of the Military Testing Association, San Diego, CA. Theunissen, T. J. J. M. (1985). Binary programming and test design. Psychometrika, 50(4), 411-420. Theunissen, T. J. J. M. (1986). Some applications of optimization algorithms in test design and adaptive testing. Applied Psychological Measurement}, 10(4), 381-389. Timminga, E. (1998). Solving infeasibility problems in computerized test assembly. Applied Psychological Measurement, 22(3), 280-291. van der Linden, W. J. (1998). Optimal assembly of psychological and educational tests. Applied Psychological Measurement, 22(3), 195-211. van der Linden, W. J. (2005). Linear models of optimal test design. NY: Springer. van der Linden, W. J. (2010). Constrained adaptive testing with shadow tests. In W. J. van der Linden, & C. A. W. Glas (Eds.), Elements of adaptive testing (pp. 31-55). New York, NY: Springer.

 

19  

Chapter 6: Test Assembly

Zheng et al.

van der Linden, W. J., & Boekkooi-Timminga, E. (1989). A maximin model for test design with practical constraints. Psychometrika, 54(2), 237-247. van der Linden, W. J., & Luecht, R. M. (1994). An optimization model for test assembly to match observed-score distributions. (Report 94-7). Enschede, Netherlands: Twente University, Faculty of Educational Science and Technology. Veldkamp, B. P. (2002). Multidimensional Constrained Test Assembly. Applied Psychological Measurement, 26, 133-146. Veldkamp, B. P. (2010). Bayesian item selection in constrained adaptive testing. Psicológica, 31, 149–169. Wald, A. (1945). Sequential Tests of Statistical Hypotheses. The Annals of Mathematical Statistics. 16(2). 117–186. Wang, C., Zheng, Y., & Chang, H. (Under review). Does variance matter? A new “variance” index for quantifying security of on-line testing. Psychometrika. Yen, W. M. (1983). Use of the three-parameter model in the development of a standardized achievement test. In R. K. Hambleton (Ed.), Applications of item response theory (pp. 123-141). Vancouver: Educational Research Institute of British Columbia. Zheng, Y., & Chang, H-H. (2011a, October). Automatic on-the-fly assembly for computerized adaptive multistage testing. Paper presented at the International Association for Computerized Adaptive Testing Conference, Pacific Grove, CA. Zheng, Y., & Chang, H-H. (2011b, April). Automatic on-the-fly assembly for computer adaptive multistage testing. Paper presented at the Annual Meeting of the National Council of Measurement in Education, New Orleans, LA. Zheng, Y., Nozawa, Y., Gao, X., & Chang, H-H. (2012). Multistage adaptive testing for a large-scale classification test: The designs, automated heuristic assembly, and comparison with other testing modes. ACT Research Reports 2012-6. Retrievable from http://media.act.org/documents/ACT_RR2012-6.pdf  

 

20