Tao, Y.-H., Wu, Y.-L., & Chang, H.-Y. (2008). A Practical Computer Adaptive Testing Model for Small-Scale Scenarios. Educational Technology & Society, 11(3), 259–274.
A Practical Computer Adaptive Testing Model for Small-Scale Scenarios Yu-Hui Tao Department of Information Management, National University of Kaohsiung, Taiwan, R.O.C. //
[email protected] // Tel: +886-7-5919220 // Fax: 886-7-5919328
Yu-Lung Wu Department of Information Management, I-Shou University //
[email protected]
Hsin-Yi Chang Corel Corporation, Taiwan, R.O.C. //
[email protected]
ABSTRACT Computer adaptive testing (CAT) is theoretically sound and efficient, and is commonly seen in larger testing programs. It is, however, rarely seen in a smaller-scale scenario, such as in classrooms or business daily routines, because of the complexity of most adopted Item Response Theory (IRT) models. While the Sequential Probability Ratio Test (SPRT) model is less complicated, it only provides the examinee’s mastery result. We thus propose an SPRT-based adaptive testing approach that is simpler to implement while still being able to approximate IRT scores in semantic or rank levels for flexible assessment needs. An English adaptive testing prototype is implemented and benchmarked to the Test of English as a Foreign Language (TOEFL) testing results. Generally, this research empirically demonstrates the validity of the proposed SPRT-based testing approach as well as the technical feasibility for teachers and business organizations to really take advantage of CAT in their daily routines.
Keywords Computer adaptive testing, Item response theory, Sequential probability ratio test, TOEFL
Introduction Traditionally, testing for evaluating knowledge, skills, abilities, and other characteristics (KSAO’s) has been done in a paper-and-pencil scenario. However, the development of information technology (IT) in the last two decades has made computer-based testing (CBT) feasible in both educational research and practice (Bunderson et al., 1989). Furthermore, today’s e-learning technology enables organizations to start adopting online instructions as well as online testing. The evolving technologies have thus moved the traditional pencil-and-paper testing toward a computer-based, or even a computer adaptive testing (CAT) scenario. In theory, CAT can dramatically reduce the testing time while maintaining the quality of measurement as compared to the fixed-item type of tests in either pencil-and-paper or CBT format (Wise & Kingsbury, 2000). Thus, it has been researched and applied extensively in larger educational institutes, certified or licensed centers (Olson, 1990; ETS, 2001; Taiwan Education Testing Center, 2007). However, CAT is not used by either classroom teachers who make up and administrate their own tests (Frick, 1992) or by business organizations in their daily KSAO’s routines. One major cause for this situation is that the most adopted CAT model - Item Response Theory (IRT) - is too rigorous to implement and maintain. Wise and Kingsbury (2000) listed item pools, test administration, test security, and examinee issues as the four general areas of practical issues in developing and maintaining IRT-based CAT programs. In particular, the adopting issues mostly fall into the item pools area, which includes pool size and control, dimensionality of an item pool, response models, item removal and revision, addition of items to the item pool, maintenance of scale consistency, and the use of multiple item pools. Since the rigorous IRT requires a large number of examinees ranging from 200 to 1000 for estimating item parameters and special expertise in item-pool maintenance, IRT is only possible in educational institutes or professional testing centers (Frick, 1992). The Sequential Probability Ratio Test (SPRT) model is another CAT model that is less adopted because it only provides the examinee’s mastery result and lack of the assessment flexibility of the IRT score. Nevertheless, the original SPRT waives the maintenance requirements for the pre-test that involves a large number of examinees (Frick, 1990). This characteristic of SPRT also suffers issues in variability in item difficulty, discrimination, or chances of guessing. An empirical study of Frick (1990) indicated that SPRT is a fairly robust model for mastery ISSN 1436-4522 (online) and 1176-3647 (print). © International Forum of Educational Technology & Society (IFETS). The authors and the forum jointly retain the copyright of the articles. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear the full citation on the first page. Copyrights for components of this work owned by others than IFETS must be honoured. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from the editors at
[email protected].
259
decisions, especially under smaller Type I and II decision error rates such as 0.025. Moreover, although parameter estimation pre-test and calibration on item pool may be preferred, IRT still suffers from accuracy or validity issues (Frick, 1990; Huff & Sireci, 2001; Welch & Frick, 1993). From the above perspectives, SPRT seems to be a practical alternative to the CAT application for school teachers and business organizations. We propose an SPRT-based CAT approach that inherits the maintenance-free item pool of SPRT strength, and approximates the grade classification of IRT spirit. In addition, to show the validity of our proposed approach, the criterion validity (Zikmund, 1997) method is adopted by comparing an English CAT prototype system based on the proposed approach with the “Test of English as a Foreign Language (TOEFL)” standard. Criterion validity was chosen because the potential source of construct-irrelevant variance originating from one’s unfamiliarity with computers had been studied and concluded to be negligible (Taylor et al., 1999). Technically speaking, the criterion validity basically answers questions like “Does my measure correlate with other measure of the same construct?” (Zikmund, 1997). In other words, if the proposed SPRT-based CAT approach can distinguish the English abilities among examinees just as TOEFL does, it can be claimed to establish the criterion validity against the TOEFL. Accordingly, our benchmark test against TOEFL can be a leading indicator for the empirical validity of the proposed SPRT-based approach.
Background CAT differs from the traditional pencil-and-paper or regular CBT testing in that an evaluation is done with a possible minimum number of questions adaptive to the ability of the examinee (Welch & Frick, 1993). Since both IRT and SPRT are the primary models in adaptive testing, and TOEFL is the benchmarking standard in our experiment, they are briefly introduced below. IRT Lord (1980) invented IRT in the early 1950s, which utilized probability to explain the relationship between the examinee’s ability and the question item response. Specifically, a mathematical model, called Item Characteristic Function (ICF) that derives a continuous increasing curve for predicting the ability and test performance, was developed to infer the examinee’s ability or potential quality. The ICF model can be classified into different variations based on the number of parameters adopted within the mathematical model. The three often-used models include single-parameter, two-parameter, and three-parameter models, which is well summarized by Yu (1992) with original references:
e (θ −b1 ) , 1 + e (θ −b1 ) e α i (θ −b1 ) Two-parameter model Pi (θ ) = , 1 + e α i (θ −b1 )
Single-parameter model Pi (θ ) =
Three-parameter model Pi (θ ) = ci + (1 − ci )
(1) (2)
e α i (θ −b1 ) , 1 + e α i (θ −b1 )
(3)
where D=1.702 e: natural Log; 2.71828 i: item number; i=1,2,3,…,N, and N is the total number of items θ: examinee’s ability ai: discrimination parameter of item i bi: difficulty parameter of item i ci: guess parameter of item i
260
IRT has a few basic assumptions, including unidimensionality, local independence, non-speeded test, and knowcorrect assumption, which need to be sustained before the model can be used to analyze the data (Ackerman, 1989). In IRT theory, an item is treated as the basic unit for measuring the examinee’s ability through the parameters. The execution follows a simple principle: if an examinee responds correctly to an item, then the next item will be one level up in terms of difficulty, and vice versa. For each item responded to, the examinee’s estimated ability will be reevaluated, and then the appropriate level of the next item is given. The process is repeated until a pre-set reliability level or stopping rule is triggered (Huff & Sireci, 2001). IRT has been widely used in CAT studies, such as those focusing on ability testing (Si, and Schumacker, 2004), CAT systems (Chou, 2000; Ho & Yen, 2005; Guzmán & Conejo, 2005) and e-learning (Yang et al., 2005; Hatzilygeroudis et al., 2006; Wang, 2006). SPRT Wald (1947) originally developed SPRT and Ferguson (1967) later adopted SPRT in education testing for the passor-fail decision. Although SPRT is easier to be implemented and thus adopted by Wang & Chuang (2002) for school teachers, it is not commonly used in education (Welch & Frick, 1993) but it is used in other domains such as Engineering (Thanh Ngoc Bui et al., 2004; Das et al., 2005) and Science (Tartakovsky et al., 2003; Jarman et al., 2004). In the standard SPRT execution, items were randomly selected, and the sequential probability ratio was calculated based on the item response. Related definitions of terminologies found in the study of Frick (1990) are as follows: LBM (Lower Bound Mastery) = (1-β )/α ,
(4)
UBN (Upper Bound Nonmastery) = β /(1-α ), Probability of correct item response PR =
s m s nm
(5) f
P (1 − Pm ) P (1 − Pnm ) f
(6)
where Pm = Probability of mastery with correct item response Pnm = Probability of non-mastery with correct item response s = number of correct item responses out of the total number of items responses so far f = number of wrong item responses out of the total number of items responses so far α = Type I error, judging mastery, but in fact non-mastery β = Type II error, judging non-mastery, but in fact mastery. When an examinee finishes an item and the system-calculated PR is greater than or equal to LBM, the result of the examinee’s test is judged to be “mastery” and the test is terminated. If the result is undetermined or “non-mastery” with UBN < PR < LBM, then the test goes on with a new randomly selected item. Otherwise, if PR≦UBN, the examinee is judged to be “non-mastery” (Frick, 1989). Although SPRT does not involve complicated mathematical formulas, there are still two basic assumptions: A test item is randomly selected from the item bank and cannot be repeated, and it has local independence such as the one in the IRT. Two requirements may prevent an instructor from developing the item pool. First, Frick (1992) reported that IRT requires 200 to 1000 examinees to adjust the parameters, which consumes more time and human resources than SPRT in preparation. Second, IRT involves complex mathematical formulas. As compared to the complexity of IRT execution, SPRT is a simpler model to implement. However, its potential drawbacks are as follows: not considering item difficulty, discrimination, and guess degree. Therefore, Frick mixed these two theories and developed an expert system-based SPRT (EXSPRT). The application of EXSPRT is similar to SPRT, but the former weighted differently on the items in the item pool. Therefore, EXSPRT requires only 50 examinees to adjust the item bank (Frick, 1992). A comparison among IRT, SPRT, and EXSPRT is summarized in Table 1. 261
Adaptive model IRT SPRT EXSPRT
Table 1. Comparison between adaptive models Examinee size for parameter Item selection estimation 200-1000 people Item response None Random Random in EXSPRT-R or Item 50 people response in EXSPRT-I
Difficulty of implementation Difficult Easy More difficult than SPRT
We would like to emphasize that even though SPRT has been used inappropriately, such as in items varying widely in difficulty levels, and/or discriminating power is reaching a master or a non-master decision before a representative sample of items has been administrated to an examinee, the SPRT decision are highly accurate when prior error rates (α and β) are kept at their minimum, as low as 0.025 (Frick, 1990). TOEFL TOEFL is an English proficiency proof for applying in universities or graduates schools in Canada and the United States, it is provided by Educational Testing Service (ETS) for the non-native English speakers (ETS, 2007). Taiwan’s TOEFL has implemented the CAT since October 2000 where each examinee is equipped with a computer and a headphone in the test site. The TOEFL CAT test only allows the examinee to see one item at a time on the screen. When an examinee presses the enter key, the computer evaluates the response and selects the next item with an appropriate difficulty. The examinee can neither skip any item nor go back to previous items.
The proposed SPRT-based CAT appraoch The main objective of this research is to propose a feasible solution for CAT to be used in typical classrooms or business organizations. The idea of the proposed approach is that it would be SPRT-based and would adopt the IRT spirit of the non-mastery result. As a base, SPRT provides its simplicity in the item pool which consists only of items with equal levels of difficulty. The IRT spirit of the non-mastery result is achieved by expanding the SPRT-based item pool into an N-tier item pool. In other words, each tier of the item pool contains only the same difficult level of items, so that an examinee’s ability can be identified to be up to one of the N levels. The rationale behind this design is that the domain experts/teachers in-charge can easily maintain the N-tier item pool based only on his/her professional expertise. No additional expertise in testing theory and parameter-estimation process is required, as in EXSPRT or IRT.
IRT item p ool i11 ,i 12 ,i 13 ,…i 1m i 21, i22 , i23 ,…i 2n i 31, i32 , i33 ,…i 3o . . .
IRT sys tem I RT alg orithm I tem selection Stopp in g rules Parameters…
CA T test
iN 1 ,i N 2 ,i N 3 , …iN x Pre-test parameter estimation >2 00 a, b, c
Figure 1. IRT CAT model
262
In order to illustrate this idea more clearly, Figure 1 shows a simplified diagram of a typical IRT-based CAT, while Figure 2 shows the proposed SPRT-based CAT. Two major differences can be easily identified from these two diagrams: the IRT includes an additional parameter-estimation (ai , bi , and ci) pre-test task for the formulas (1)-(3) to work, while the SPRT-based CAT has the item pool divided into N tiers by splitting various difficulty levels of IRT items into cohesive groups.
N-tier SPRT item pool i 11, i12 ,i 13 ,…,i 1m i 21, i22 ,i 23 ,…,i 2n i 31, i32 ,i 33 ,…,i 3o
SP RT-based system SP RT algor ithm
CAT te st
Ite m s election Stopping rules …
… i N1,i N2 , iN3 ,…, iNx Figure 2. SPRT-based CAT approach In creditable certified or license centers, it is even required that the SPRT item pool be rigorously verified against the three-parameter logistic model (PLM) IRT as suggested by Kalohn & Spray (1999). However, in order for this SPRT-based CAT approach to be easily adopted in daily routines, we assume that the tedious parameter-estimation task may be omitted as long as certain validity is retained in it. Here the certain validity means that SPRT is robust under small prior error rates (α and β) as seen in Section 2.2, which is probably acceptable to the teachers or managers in small-scale scenarios compared to the large testing centers. Therefore, a question remains to be answered: how does the outcome of the proposed approach compare to the IRT score? Since each tier of the item pool represents a different scale of ability, an examinee goes up and down a certain ability level and through the SPRT item pool from a certain ability level before finally stopping at the first tier judged to be non-mastery or undetermined. Then the examinee obtains a semantic label of ability, such as medium or proficient, or an ability level, such as level 1 or level 3. Note that the ability level still requires further interpretation, such as 1 indicates proficient or novice, while the semantic label is already an interpreted ability-level in human language. To see how the SRPT-based CAT approach works, an integrated process for the CAT example is illustrated in Figure 3. The upper half of Figure 3 depicts a single SPRT process cycle, while the lower half represents the N-tier SPRT algorithm that determines how to repeat the single SPRT process cycle. In our CAT scenario, three objectives of items include Listening, Grammar, and Reading, with each having six difficulty levels that are set up in the first two actions in the SPRT base. For the most complex requirement satisfying different perspectives of English ability, the appropriate ability levels of all three objectives for an examinee need to be identified. If an examinee starts with Grammar, the default tier is level 3, and an item is randomly selected from the third tier item pool. When an item response is evaluated to be PR≧LBM, then level 3 would be judged to be “mastery” and level 4 is triggered. If PR≦UBN, then level 3 is judged to be “non-mastery” and level 2 is triggered. Otherwise, UBN≦PR≦LBM means that level 3 is undecided. Then the next item will be randomly selected again for the examinee to answer if the stopping condition has not yet been met. When the ability level of Grammar is determined, then either Listening or Reading is triggered. This complex scenario does imply a tradeoff in this SPRT-based CAT approach for flexible assessment needs. That is, it takes longer than the IRT CAT to decide the examinee’s ability due to the N-tier structure that repeats the SPRT test up to N times.
263
Select difficulty level (Default at 3)
Select either Gramma r, Reading or Listening
Randomly select an item Respond Evaluate
Yes
Pass (PR>=LB M)?
No
Yes
Continue testing ? (UBM< PR