Maintaining Content Validity in Computerized Adaptive ... - Springer Link

2 downloads 0 Views 350KB Size Report
National Board of Medical Examiners, 3750 Market St., Philadelphia, PA, ... medical licensure (Morrison and Nungester, 1995), has focused almost exclusively.


Advances in Health Sciences Education 3: 29–41, 1998. c 1998 Kluwer Academic Publishers. Printed in the Netherlands.

29

Maintaining Content Validity in Computerized Adaptive Testing RICHARD M. LUECHT, ANDRE´ DE CHAMPLAIN and RONALD J. NUNGESTER National Board of Medical Examiners, 3750 Market St., Philadelphia, PA, U.S.A. 19104 E-mail:[email protected]

Abstract. A major advantage of using computerized adaptive testing (CAT) is improved measurement efficiency; better score reliability or mastery decisions can result from targeting item selections to the abilities of examinees. However, this type of engineering solution can result in differential content for different examinees at various levels of ability. This paper empirically demonstrates some of the trade-offs which can occur when content balancing is imposed in CAT forms or conversely, when it is ignored. That is, the content validity of a CAT form can actually change across a score scale when content balancing is ignored. On the other hand, efficiency and score precision can be severely reduced by over specifying content restrictions in a CAT form. The results from two simulation studies are presented as a means of highlighting some of the trade-offs that could occur between content and statistical considerations in CAT form assembly. Key words: computerized-adaptive testing, content validity, item response theory

Maintaining Content Validity in Computerized Adaptive Testing The advent of large-scale computerized adaptive testing (CAT) has had a major impact with respect to the assessment of student abilities in the United States and throughout the world. CAT forms possess several advantages over traditional paper-and-pencil forms: – they can be tailored to the ability or proficiency level of each examinee, hence considerably reducing testing time and test length; – they can also provide a more accurate estimate of a person’s proficiency than is usually possible with a paper and pencil test (Hambleton, Zaal and Pieters, 1991). Items that are too easy or too difficult for a given examinee provide very little useful measurement information regarding their proficiency level. By properly targeting the item selections to the abilities of the examinees, we can maximize the obtained measurement information for each test taker and improve the reliability of score estimates or classification decisions. Maximizing measurement information minimizes measurement errors. CAT is therefore said to be statistically efficient either because: (i) the test length can be reduced on average for many examinees with no loss in measurement precision or decision accuracy, or (ii) the precision of scores or decision accuracy can be increased for a fixed or average test length.

30

RICHARD M. LUECHT ET AL.

Much of the CAT research over the past several decades, including that related to medical licensure (Morrison and Nungester, 1995), has focused almost exclusively on statistical issues such as efficiency and reliability (Hambleton, Zaal and Pieters, 1991; Wainer, 1990) while relatively little attention has been centered on more practical issues, such as content validity (Kingsbury and Zara, 1991; Thomasson, 1995). Clearly, psychometric concerns such as improving score or decision reliability and the efficiency of testing are important. However, those improvements ought not to confound what it is we are attempting to measure. More efforts need to be undertaken to address fundamental validity concerns. This paper provides a brief introduction to some of the statistical concepts underlying CAT. Then, the results obtained in two simulation studies are provided to illustrate what can happen in instances where content validity is and is not ignored. Ultimately, this paper stresses the importance of taking a more comprehensive perspective on efficiency than has been suggested by previous CAT research.

AN OVERVIEW OF CAT TECHNOLOGIES CAT technology can be applied to a wide range of computer administered assessments, where test difficulty is specifically targeted to match the proficiency of each examinee or where items are selected to provide an accurate pass/fail or mastery decision using as few items as possible. Applications of CAT can be found in achievement or performance assessments, aptitude tests, certification tests, and licensure examinations. Unlike conventional standardized paper-and-pencil tests, where all examinees receive a common test form or test battery comprised of the same items, in CAT, each examinee may receive a different assortment of items selected according to various rules, goals, or procedures. The notion of adaptive distinguishes CAT from more generic computer-based tests (CBT). A CAT is fundamentally a CBT with extensive algorithms added to the computer delivery software to govern the selection of items for each examinee. These algorithms essentially build a customized test for each examinee. Implementing CAT depends on: (1) having a calibrated item pool; and (2) using a particular adaptive test construction process. CAT item pools must be calibrated in order to statistically estimate the characteristics of the items for the intended population of examinees. Item statistics stored in the pool are usually obtained from previous administrations of the items (e.g. via extended pretesting). Without calibrated item statistics, CAT technology, i.e. the item selection algorithms, cannot be effectively used. The calibrated item statistics describe the difficulty and other statistical characteristics of each item for the examinee population (e.g. how well the item discriminates among examinees of varied abilities). Powerful psychometric models and inferential procedures developed under item response theory (IRT) are employed to obtain the item statistics and to simultaneously place the statistics on a fixed measurement scale or

MAINTAINING CONTENT VALIDITY IN CAT

31

Figure 1. Items and examinees’ score calibrated to a common scale.

metric. This calibration process is directly analogous to calibrating laboratory or engineering equipment using an external metric or baseline measures. IRT allows the items statistics and the scores of examinees to be represented on the same score scale. This is illustrated in Figure 1 and has important implications for targeting items in a CAT to a pass/fail point on the score scale or to the estimated score of a particular examinee. As Figure 1 demonstrates, item 3 is the easiest item whereas item 7 is the most difficult item. On the same scale, examine A has the lowest score while examine C has the highest score. Finally, using a common scale, we see that items 6, 5, and 9 are closely targeted, respectively, to the scores for examinees A, B, and C. The calibrated item statistics, along with content and other qualitative properties of each item, serve as the primary, fixed inputs to the item selection engine used in most CAT software. The CAT item selection engine determines which items are administered to each examinee. It is important to realize that when an examinee sits down to take a CAT, very little is known about the candidate’s level of proficiency. Starting with virtually no information, a CAT administers a single item from the item pool (or, in some cases, several items). The selection of the first item is essentially an educated guess or may simply be a random selection from the pool. From then on, each response the examinee gives to each selected item increases our certainty (decreases the likelihood of errors) about the examinee’s score. Figure 2 shows the degree of certainty about an examinee’s score after 3 items and again after 100 items. For the sake of this example, assume that we know that this examinee’s true proficiency score is approximately 220. After only 3 items, our certainty is represented by the flat, wide curve indicating a lack of confidence about the precision of the provisional estimate near the peak of that curve (at a score of about 205). However, after administering 100 items, we find that: (i) the provisional score estimate is quite close to the true proficiency; and (ii) our certainty is very high, as indicated by the tall, narrow curve.

32

RICHARD M. LUECHT ET AL.

Figure 2. Certainty about provisional score estimates after 3 & 100 items.

This is essentially how most CAT algorithms function-estimating a provisional score after administering each item until our level of certainty is high enough to trust that we are near the examinee’s unknown true proficiency (or, until we are certain enough to make a clear classification decision). In the language of IRT, we refer to the incremental precision that items contribute to scores as information. Maximizing information has been largely viewed as the sole criterion for CAT. That is, if we can maximize information, we can give shorter tests with no loss in measurement precision or increase our score precision while holding test length constant. Unfortunately, that singular maximum information criterion ignores other critical pragmatic factors in testing, chief among them, content balancing issues. In many past CAT studies, test content has been routinely delegated to a secondary level of importance. Although researchers have discussed content balancing, little has been done to seriously investigate the role and impact of content in adaptive testing settings (Thomasson, 1995). CONTENT ISSUES IN CAT Most established testing programs dedicate a great deal of attention to content. In fact, the validity of a test depends primarily on demonstrating that the test content is relevant to the use of the test scores or decisions (Kane, 1982). For this reason, test designers may go to great lengths to develop detailed content “blueprints” that specifically map out content area allocations for a given test form. The extent to which the items in a test meet the blueprint specifications can be easily assessed within a paper-and-pencil framework given the limited number of forms administered at any given time. However, this process is considerably more

33

MAINTAINING CONTENT VALIDITY IN CAT

complex within a CAT environment where a large number of forms are tailored towards various examinees as a function of their proficiency level. In a complex testing environment such as medical licensure, the content detail is extensive and experts require assurances that every examinee is thoroughly tested on all aspects of the field. It is also possible that the difficulty of items in a particular pool may vary systematically for different content areas. For example, for a particular cohort of medical students, items on basic principles of behavioral sciences may be easier, on average, than biochemistry items. Within a CAT context, there are serious implications for the validity of scores or classification decisions if these types of relationships between content and item characteristics are ignored. The purpose of this investigation was to illustrate, via two simulation studies, the nature and extent of some of the problems related to content and statistics in a CAT context. A variety of trade-offs may occur between managing content requirements and maximizing measurement or statistical efficiencies. While we recognize the critical importance of content validity in CAT, we must also not lose sight of the need for accurate scores as well. Methodology DATA GENERATION Dichotomous item responses were generated using a two-parameter IRT model (Birnbaum, 1968; Lord, 1980) given by [ai (j fProb uij = correctjj ; ai; bi g  Pij = 1 +expexp [a ( i

j

bi )] ; bi )]

(1)

where: uij is the dichotomous response of examinee j to item i; j denotes the ability score or unobserved trait of interest for a random examinee, j ; ai is an item discrimination parameter representing the sensitivity of item i to the underlying trait,  , and; bi is an item difficulty parameter or location of the item on the  scale. Using this model, the probability of a correct response to item i by examinee j (denoted by uij = 1) is a function of the examinee’s ability, j , and the item parameters, ai and bi . Examinee item responses were randomly generated from a unit normal distribution and subsequently linearly rescaled to a mean of 50 and a standard deviation of 15 to facilitate the interpretability of results. The item discrimination (ai ) and the item difficulty parameters (bi ) were set to predetermined values, usually within various content categories, to account for the location and amount of measurement

34

RICHARD M. LUECHT ET AL.

information in the item pools. For a particular item pool and sampled ability, j , Pij from Equation 1 was computed, along with a random uniform probability, ij . For ij  Pij , a score of uij = correct was assigned; otherwise, uij = incorrect was assigned. The generated scores were then used in the corresponding CAT simulation, with the generating item parameters treated as known values. The same examinee response vectors were used for each simulated type of adaptive test (content-balanced or not). Therefore, examinee sampling effects were minimized across the item selection conditions of each study. CAT ITEM SELECTIONS The initial item for each examinee’s CAT was randomly selected. The response to that and subsequent items, i.e. uij , was used to compute a provisional estimate of j (denoted as ^j ). The next item was then chosen from the available (i.e. unselected) pool, such that I (^j ; ai ; bi ) = a2i Pij (1

Pij )

(2)

was maximized, where I (^j ; ai ; bi ) is known as the item information function (Birnbaum, 1968), with Pij defined by Equation 1. These item information functions are additive across items and, in sum, are inversely related to the estimation error variance of the ^j values about the true value, j ; i.e.

X n

=

i 1

I (^j ; ai ; bi ) =

1

Vare (^j j j )

(3)

Therefore, by selecting items which maximize Equation 2, we simultaneously minimize the error variance of the estimated scores. For the content-balanced CATs, the same information maximization procedures were used, except that the item selections were further constrained to match a predetermined quantity within particular content categories. Study 1: Some Simple Effects of Ignoring Content Balancing in CAT Form Assembly In the first simulation study, an item pool of 1000 items was used. The item pool was equally divided into two content areas, A and B. INFORMATION FUNCTIONS Figure 3 shows the cumulative information function curves for each set of items. These information curves were computed as the sum of the individual item information functions (c.f. Equation 2) for the 500 items in content area A and for the 500 items in content area B, at various points along the ability score scale, 

MAINTAINING CONTENT VALIDITY IN CAT

35

Figure 3. Information function curves for simulation #1 items (500 items in A and B).

(rescaled to a mean of 50 and a standard deviation of 15). It appears as though the items in content area A are better targeted towards higher scoring examinees whereas content area B items are more accurately matched to lower scoring examinees. Furthermore, the items in content area A are nominally more informative (have higher information in sum) than content area B items. This type of pool could reflect the differences between biochemistry and behavioral science items, as alluded to earlier. Using this pool of 1000 items, 2000 simulated examinee responses were generated as noted above. Each examinee was administered two 100-item fixed length CATs. The first CAT (unbalanced) was administered by simply choosing the most informative items for each examinee. The second CAT (balanced) imposed content restrictions so that each examinee received exactly 50 items from content area A and 50 items from content area B, while still attempting to maximize the item information. The same responses were available for each examinee under each CAT condition. Results SUMMARY STATISTICS The estimated abilities for the 2000 examinees under the 100-item unbalanced and balanced adaptive tests are summarized in Table I. In comparing the score distributions and accuracy of the scores, virtually no differences are apparent between the two adaptive testing conditions. One might say that the examinees ought to be indifferent as to which type of test was given since it appears to have made little difference in their scores. However, the impact the administered test might have with respect to content validity issues is important. Figure 4 shows the average number of items in content

36

RICHARD M. LUECHT ET AL.

Table I. Summary of Score Estimates for 2000 Simulated Examinees in Simulation Study I Summary Number of items administered per examinee Mean estimated score Standard deviation of scores Average std. error of estimate Reliability of score estimates

Unbalanced

Balanced

100 50.10 13.70 1.92 0.98

100 50.05 13.71 1.94 0.98

Figure 4. Outcomes of not content balancing (UNBALANCED).

areas A and B administered to examinees at different regions of the score scale for the unbalanced CAT condition. (Recall that for the balanced condition, each examinee saw exactly 50 items in both content areas A and B.) Examinees whose score was less than 38 saw predominantly items from content area B. In fact, as many as 80% of the items were selected solely from content area B for those examinees whose scores were located at the lowest regions of the score scale. Conversely, for high scoring examinees, nearly 80% of the selected items were from content area A while only 20% of the items were chosen from content area B. Clearly, examinees did not receive the same blend of content across the score scale. Whether or not these average percentages of items seen within each content area are sufficiently divergent to warrant validity concerns largely would depend on the purpose of the test and any claims made by the test developers about content coverage. Nonetheless, ignoring content balancing can have apparent and dramatic consequences.

MAINTAINING CONTENT VALIDITY IN CAT

37

Figure 5. Information functions for study #2: Content areas I, II, III, IV & V.

Study 2: Some Complex Effects of Ignoring Content Balancing in CAT Form Assembly A total of 1500 items were utilized in the second simulation study undertaken. These items were equally divided into five content areas (i.e., 300 items in each content area) labeled I, II, III, IV, and V. Two 100-item CAT forms were administered to 1000 simulated examinees: (1) a standard CAT ignoring content balancing (unbalanced); and (2) a content-balanced CAT requiring that exactly 20 items be selected from each of the five content areas (balanced). INFORMATION FUNCTIONS Figure 5 shows the cumulative information functions for the 300 items in the five content areas. As shown in that figure, content set I and II items have similar amounts and locations of item information, as indicated by the peaks of the curves. Set III items have a similar amount of information to set I and II items but they are targeted toward the lower region of the score scale. Set IV items are similar in average item difficulty to set III items, but contribute proportionally less information than set IV items. Set V items are more similar in difficulty to set I and II items but are only as informative as set IV items, on average.

38

RICHARD M. LUECHT ET AL.

Table II. Summary of Score Estimates for 1000 Simulated Examinees in Simulation II Summary Number of items administered per examinee Mean estimated score Standard deviation of scores Average std. error of estimate Reliability of score estimates

Unbalanced

Balanced

100 50.08 14.03 2.22 0.97

100 50.11 13.80 2.35 0.97

Results SUMMARY SCORE STATISTICS The characteristics of the estimated score distributions for the unconstrained adaptive simulations (unbalanced) and the content-constrained adaptive tests (balanced) are summarized in Table II. As was the case in the first simulation study, there were no apparent differences in these outcomes at the total test level. The means and standard deviations indicate that the scores were similarly distributed while the standard errors and reliability coefficients indicate approximately equivalent accuracy in the estimated scores. SCORE PRECISION In order to better understand the impact of using content constraints with respect to accuracy, scores were estimated for the balanced content CAT using only 40 items from two of the five content areas. Set I items were included in each content pairing as a common baseline for judging the magnitude of the estimation errors. The extent of these differences in accuracy for the content-based score estimates enabled us to assess the amount of information or precision each content area contributed to the scores, in addition to the accuracy of set I items. The bar chart shown in Figure 6 shows the average empirical standard errors of the estimated scores, that is, the absolute difference between each estimated score and the examinee’s true score (i.e. the random normal score used to generate the simulated response data). The bars are repeated for each the four content pairings and are summarized for nine equal-sized intervals along the score scale. Larger values (higher bars) are indicative of a greater degree of error. The standard errors for the score estimates based only on the set I and II items tend to be small on average, especially in the central region of the scale where most scores are located. Referring back to Figure 5, it is clear that the information functions for those two content areas (I and II) were virtually coincident and both covered a fairly large range of the score scale. Similarly, the scores based on set I and III items have small estimation errors. However, they appear to be more accurate than set II estimates in the lower region of the score scale, where set III items were contributing most of their information.

MAINTAINING CONTENT VALIDITY IN CAT

39

Figure 6. Empirical std. errors for estimated scores based on 40 paired-content items.

In contrast, the errors are considerably larger throughout the scale for scores calculated from both content sets I & IV and I & V items. The explicit constraints which forced the balanced CAT algorithm to select 20 items each from sets IV and V led to a noticeable reduction in score precision, at least based on these 40 items. However, the validity of the test from a content perspective also needs to be examined when we do not explicitly employ balancing. The unbalanced tests did exhibit a rather apparent and fairly serious content validity problem. Table III shows the marginal percentages of items selected within each content area for the unbalanced CAT. The score interval midpoint at which items in each content set were most likely to be selected (i.e. the modal interval midpoint value) is also shown in Table III. Items were selected from sets I and II in nearly equivalent proportions, based solely on the CAT information maximization algorithm, and were most likely to be selected for examinees in the neighborhood of the mean (i.e. the score interval directly above the mean). Set III items were selected less often, on average, and primarily for examinees below the mean. Neither of these results is particularly unanticipated given the information functions shown in Figure 5. However, content area IV and V items were selected less than one percent of the time and then, only for very high scoring examinees.

40

RICHARD M. LUECHT ET AL.

Table III. Marginal Percentages of Items Selected content and Modal Interval of Selection Item content areas

Percentage of item-by-examinee transactions

Modal score value for item selections

Set I Set II Set III Set IV Set V

38 37 25 80.0

Entirely excluding these two content areas could represent a serious violation of the content validity of the unbalanced tests. Furthermore, from the view point of efficient use of resources, these results indicate that at least 40 percent of the item pool (the 600 items in sets IV and V divided by the 1500 items in the pool) was essentially never administered. Considering the costs associated with the production of 600 items, one would be hard pressed to justify not using them. Unfortunately, a strict statistical definition of efficiency tends to ignore those types of tangible costs. Discussion The simulation studies presented in this paper were neither meant to be generalizable to any real examinations nor comprehensive in scope. Rather, they served to emphasize the trade-offs which can occur when attempting to meet content validity requirements and maximizing score precision or decision accuracy in a CAT environment. It is fairly well established that an unreliable test cannot be valid (formally, the reliability coefficient is the upper bound limit on a validity coefficient.) At the same time, a highly reliable test is not necessarily valid. CAT therefore helps establish the necessary condition for validity (i.e. high reliability), but this is not a sufficient condition to ensure validity. Implicit in this paper is the recommendation that evaluating the trade-offs between content balancing and statistical criteria in CAT form assembly should start with a thorough analysis of the item pool. Analyzing the amount of item information in an item pool separately by major content categories is an excellent way to begin to comprehend the particular trade-offs that may occur. Obviously, adding too many detailed content constraints in the item selection process will tend to increase the requisite size of the item pools. There must be enough “degrees of freedom” to ensure a successful balancing of content and maximum information. Specifying a larger number of constraints than there are items will lead to reduced score precision. Arbitrary constraints are costly and should probably be avoided. These simulation studies also provided some preliminary insights into the interactions that exist between content, statistical criteria, and CAT item selection

MAINTAINING CONTENT VALIDITY IN CAT

41

mechanisms. First, when items within specific content areas have only small average amounts of information to contribute, adding constraints in those areas will usually reduce score precision. In such cases, it seems desirable to expand the item pool by focused item writing in those content areas to specifically increase the amount of information available at various points along the score scale. This becomes a nontrivial task since it requires that items be written to particular levels of difficulty as well as to meet content requirements. Second, when items in certain content areas vary by the average location of their maximum information, but not by amount, the effect of content balancing on score precision will probably be to cause a moderate reduction in score precision, at worst. Nevertheless, in such cases, the information functions should be monitored over time to ensure that the different content areas remain relatively stable in terms of average difficulty and the amount of information available at points along the score scale. Finally, the ideal situation occurs when the location and amount of information, on average, is equivalent across content areas. Content balancing comes at a nominal cost from a statistical perspective and reliability can also be maximized. Obviously, there are numerous other factors to consider when implementing a testing program under a CAT paradigm. This paper merely stresses the need to consider both content validity and reliability issues as critical test development factors and to avoid the temptation to view CAT as a panacea for testing strictly on the grounds of improved efficiency. References Birnbaum, A. (1968). Some Latent Trait Models and Their Use in Inferring an Examinee’s Ability. In Lord, F.M. & Novick, M.R. (eds.), Statistical Theories of Mental Test Scores, 397–422. Addison Welsey: Reading. Hambleton, R.K., Zaal, J.N. & Pieters, J.P.M. (1991). Computerized Adaptive Testing: Theory, Applications and Standards. In Hambleton, R.K. & Zaal, J.N. (eds.), Advances in Educational and Psychological Testing, 341–366. Kluwer Academic Publishers: Boston. Kane, M.T. (1982). A Sampling Model for Validity. Applied Psychological Measurement 6, 125–160. Kingsbury, G.G. & Zara, A.R. (1991). A Comparison of Procedures for Content-Sensitive Item Selection in Computerized Adaptive Tests. Applied Measurement in Education 4, 241–261. Lord, F.M. (1980). Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum Associates: Hillsdale. Morrison, C.A. & Nungester, R.J. (1995, April). Computerized Adaptive Testing in a Medical Licensure Setting: A Comparison of Outcomes Under the One- and Two-Parameter Logistic Models. Paper presented at the meeting of the National Council on Measurement in Education, San Francisco, CA. Thomasson, G.L. (1995, June). New Item Exposure Control Algorithms for Computerized Adaptive Testing. Paper presented at the meeting of the Psychometric Society, Minneapolis, MN. Wainer, H. (1990). Computerized Adaptive Testing: A Primer. Lawrence Erlbaum Associates: Hillsdale.

Suggest Documents