process, the rationale for using the item-mapping method is discussed. ... the standard-setting studies, rating data fro
Journal of Educational Measurement Fall 2003, Vol. 40, No. 3, pp. 231-253
Use of the Rasch IRT Model in Standard Setting: An Item-Mapping Method Ning Wang Widener University This article provides both logical and empirical evidence to justifr the use of an item-mapping method for establishing passing scores for multiple-choice licensure and certijication examinations. After describing the item-mapping standard-setting process, the rationale and theoretical basis for this method are discussed, and the similarities and differences between the item-mapping and the Bookmark methods are also provided. Empirical evidence supporting use of the item-mapping method is provided by comparing results from four standard-setting studies for diverse licensure and certijkation examinations. The four cut score studies were conducted using both the item-mapping and the Angoff methods. Rating data from the four standard-setting studies, using each of the two methods, were analyzed using item-by-rater random effects generalizability and dependability studies to examine which method yielded higher inter-judge consistency, Results indicated that the item-mapping method produced higher interjudge consistency and achieved greater rater agreement than the Angoff method.
Licensure and certification examinations are developed to determine whether candidates have adequate levels of knowledge and skills to perform competently in their professions. One of the most difficult steps in creating these examinations is determining a score that decides a passing level of competency. The pass/fail decision is high stakes, both for candidates, and for the public that is served by candidates who pass. It is vital that the passing score reflects an appropriate standard of performance, a standard that differentiates between candidates who are sufficiently competent, and those who have not obtained the level of competence to be licensed. Defining and translating such a standard into a fair and legally defensible cut score requires the systematic application of a well-reasoned, reputable standard-setting method (Berk, 1996; Mills, 1995; Norcini & Shea, 1997). Among the standard-setting methods, the Angoff (1971) procedure with various modifications is the most widely used for multiple-choice licensure and certification examinations (Plake, 1998). As part of the Angoff standard-setting process, judges are asked to estimate the proportion (or percentage) of minimally competent candidates (MCC) who will answer an item correctly. These item performance estimates are aggregated across items and averaged across judges to yield the recommended cut score. As noted (Chang, 1999; Impara & Plake, 1997; Kane, 1994), the adequacy of a judgmental standard-setting method depends on whether the judges adequately conceptualize the minimal competency of candidates, and whether judges accurately estimate item difficulty based on their conceptualized minimal competency. A major criticism of the Angoff method is that judges’ estimates of item difficulties for minimal competency are more likely to be inaccu23 1
rate, and sometimes inconsistent and contradictory (Bejar, 1983; Goodwin, 1999; Mills & Melican, 1988; National Academy of Education [NAE], 1993; Reid, 1991; Shepard, 1995). Studies found that judges are able to rank order items accurately in terms of item difficulty, but they are not particularly accurate in estimating item performance for target examinee groups (Impara & Plake, 1998; National Research Council, 1999; Shepard, 1995). A fundamental flaw of the Angoff method is that it requires judges to perform the nearly impossible cognitive task of estimating the probability of MCCs answering each item in the pool correctly (Berk, 1996; NAE). An item-mapping method,’ which applies the Rasch IRT model to the standardsetting process, has been used to remedy the cognitive deficiency in the Angoff method for multiple-choice licensure and certification examinations (McKinley, Newman, & Wiser, 1996). The Angoff method limits judges to each individual item while they make an immediate judgment of item performance for MCCs. In contrast, the item-mapping method presents a global picture of all items and their estimated difficulties in the form of a histogram chart (item map), which serves to guide and simplify the judges’ process of decision making during the cut score study. The item difficulties are estimated through application of the Rasch IRT model. Like all IRT scaling methods, the R a s h estimation procedures can place item difficulty and candidate ability on the same scale. An additional advantage of the Rasch measurement scale is that the difference between a candidate’s ability and an item’s difficulty determines the probability of a correct response (Grosse & Wright, 1986). When candidate ability equals item difficulty, the probability of a correct answer to the item is S O . Unlike the Angoff method, which requires judges to estimate the probability of an MCC’s success on an item, the item-mapping method provides the probability (i.e., S O ) and asks judges to determine whether an MCC has this probability of answering an item correctly. By utilizing the Rasch model’s distinct relationship between candidate ability and item difficulty, the item-mapping method enables judges to determine the passing score at the point where the item difficulty equals the MCC’s ability level. The purpose of the present article is to provide both logical and empirical evidence to justify the use of the item-mapping method for establishing cut scores for licensure and certification examinations. After describing the standard-setting process, the rationale for using the item-mapping method is discussed. The theoretical basis behind this method (i.e., the Rasch IRT model) and justification for the use of the probability value in setting a cut score are also presented. Empirical evidence is provided by comparing results of four standard-setting studies that used both the item-mapping and the Angoff methods to set cut scores for diverse licensure and certification examinations. In addition to reporting and discussing the cut scores of the standard-setting studies, rating data from the standard-setting studies were analyzed using item-by-rater random effects generalizability and dependability studies to address the following research question: Does the item-mapping or the Angoff method yield more consistent item performance estimates by judges (i.e., which method yields higher inter-judge consistency)? This article also recognizes the need for further research to provide additional evidence of validity to support the use of the item-mapping standard-setting method and to validate the use of the S O probability value. 232
An Item-Mapping Method
The Item-Mapping Standard-Setting Process’ The item-mapping method incorporates item performance in the standard-setting process by graphically presenting item difficulties. In item mapping, all the items for a given examination are ordered in columns, with each column in the graph representing a different item difficulty. The columns of items are ordered from easy to hard on a histogram-type graph, with very easy items toward the left end of the graph, and very hard items toward the right end of the graph. Item difficulties in log odds units are estimated through application of the Rasch IRT model (Wright & Stone, 1979). In order to present items on a metric familiar to the judges, logit difficulties are converted to scaled values using the following formula: scaled difficulty = (logit difficulty x 10) + 100. This scale usually ranges from 70 to 130. Figure 1 provides an example of an item map. In the example, the abscissa of the graph represents the rescaled item difficulty. Any one column has items within two points of each other. For example, the column labeled “80” has items with scaled difficulties ranging from 79 to values less than 81. Using the scaling equation, this column of items would have a range of logit difficulties from -2.1 to values less than - 1.9, yielding a 0.2 logit difficulty range for items in this column. Similarly, the next column on its right has items with scaled difficulties ranging from 81 to values less than 83 and a range of logit difficulties from - 1.9 to values less than - 1.7. In fact, there is a 2-point range (1 point below the labeled value and 1 point above the labeled value) for all the columns on the item map. Within each column, items are displayed in order by item ID numbers and can be identified by color and symbol-coded test content areas. By marking item content areas of the items on the map, a representative sample of items within each content area can be rated in the standard-setting process. The goal of item mapping is to locate a column of items on the histogram where judges can reach consensus that the MCC has a S O chance answering the items correctly. The process works as follows. After an extensive discussion of the characteGstics of MCCs, the item map histogram is presented to the standard-setting committee (judges). An easy item is first selected from the graph by the standard-setting facilitator. The judges are asked to review the item text independently, and to provide a yes or no answer to the question, “Does a typical MCC have at least a S O chance answering the item correctly?” Since the selected item is very easy, most of the judges would agree that the MCC is able to answer the item correctly, but at a greater chance than .50. The facilitator now selects a more difficult item and repeats the process until a column of items is located where most items in the column have consensus from the judges that the probability of a correct response for the MCC is S O . The level of difficulty under this column of items represents the cut score. Since there is a 0.2 logits range of item difficulties in any given column, normally the middle value is chosen as the cut point. Once a column of items is identified where the MCC has a S O chance of answering the items correctly, the standard-setting process produces a complete picture of the MCC’s ability in terms of the probabilities of correctly answering any item on the item map. This is accomplished by using a probability ruler (see Figure 2), which is always constructed on the same scale as the item map. With the 233
0 probability ruler aligned so that the S O marker is centered over the column selected to be the cut, the probability that the MCC will correctly answer a particular item can be identified. For example, if a column of items with a scaled difficulty of 100 is selected as an initial cut score, the facilitator will slide the probability ruler to have the S O marker aligned with the column of 100 (see Figure 3). Then, the ruler will show that the MCC will have an 87% chance of success on the column of
An Item-Mapping Method
FIGURE 2. An example of probability rulel:
items above 81, a 13% chance of success on the column of items above 119, and so forth. It should be noted that this statement is always made relative to the column of items where the .50 marker is placed. Given this picture of the MCC’s ability, judges can determine whether the initial cut meets their conceptual understanding of minimal competency, and whether it is necessary to iterate the rating process to adjust the initial cut score. Near the end of the standard setting, a histogram representing the distribution of candidate performance from a reference group of test takers is presented to predict the passing percentage, given the established cut score. A line is drawn at the ability level represented by the proposed cut score, yielding an immediate pass rate projection based on the cut score. This gives judges feedback on the reasonableness of the proposed cut score, given the distribution of candidate abilities. Figure 4 provides an example of an ability distribution histogram. In this example, if a column of items with a scaled difficulty of 100 is selected as the cut score, the predicted pass rate is approximately 77.6%.
Rationale Researchers offered the following two reasons for judges being unable to accurately estimate item performance for MCCs in test-centered standard settings (Impara & Plake, 1997): (a) judges may have difficulty conceptualizing MCCs; and (b) judges may have difficulty estimating percentage of success, even for clearly defied MCCs. The item-mapping method provides useful strategies, such as ease
235
Wang
FIGURE 3. Probability ruler aligned with the scaled dificulty of 100.
in conceptualizing an MCC, reduced cognitive complexity in determining the chance of an MCC’s success, and visualized presentation of item difficulties in a global map, aimed at minimizing these problems. Conceptualizing minimal competency and translating the concept into an operational definition of MCCs is typically required in setting cut scores for licensure and certification examinations. An extensive discussion is usually involved to help judges develop a common understanding of minimal competency (Cizek, 1996; Mills, Melican, & Ahluwalia, 1991). Even so, to conceptualize a group of hypothetical candidates is still considered to be a very difficult task for judges in test-centered standard settings (Impara & Plake, 1997). The item-mapping method helps judges with the conceptualization of the MCC by asking them to consider a real and typical MCC in a training program or class. Then, the question for judges in item mapping is whether this real candidate has at least a .50 chance of answering an item correctly (i.e., a yes or no answer), rather than providing an estimate of item performance for a group of MCCs. This strategy is considered to be beneficial to the judges’ thinking process, and helpful in reducing the cognitive complexity in conceptualizing the MCC. Research reported that judges found conceptualizing a single, typical MCC to be easier and more realistic than imagining a group of hypothetical MCCs (Impara & Plake). The item-mapping method also helps reduce the cognitive complexity in estimating item performance for the MCC in several ways. Considering a real and typical MCC simplifies the judgment on the MCC’s item performance. Studies show that
236
An Item-Mapping Method
1.5%
O L
Scaled Ability FIGURE 4. An example of candidute ability distribution.
teachers are more accurate in estimating the performance of individual students in their class than in estimating the performance for the total group of students (Hoge & Coladarci, 1989; Impara & Plake, 1997). Also, asking judges to consider a given probability (i.e., SO) for the MCC answering an item correctly (i.e., providing a yes or no answer) reduces the decision to a dichotomous process, and thus involves less cognitive processing than having the judges provide the probability estimates for the MCC answering the item correctly. Research in psychology indicates that dichotomous processing is a fundamental phenomenon of the human mind; and requiring a dichotomous response such as a true/false or a yedno answer is the most effective and accurate way to collect responses from people (Stone, 1998). The global picture of item difficulties (i.e., the item map) provides judges an intuitive and relative standing of items in terms of item difficulties. This picture serves as an external benchmark in guiding judges' decision-making processes. Studies found that standard-setting results were more reproducible when item performance data were provided (Norcini, Shea, & Kanya, 1988; Swanson, Dillon, & Ross, 1990). The item-mapping method provides not only item performance data, but also a relative relationship of items in a simple histogram. The modem theory of cognitive science indicates that for human information processing, visual representation is less cognitively demanding than other modes of representation, such as symbolic and verbal (Simon, 1989). Therefore, the strategy of visually presenting item difficulties in a global item map simplifies the judging process in a standard setting, and helps judges provide more accurate and consistent estimates of the MCC's performance on particular items than the Angoff method.
237
Wang
The Rasch IRT Model and Use of the Probability Value The Rasch IRT model provides the theoretical basis for the item-mapping standard-setting method. The basic premise underlying the Rasch IRT model is that observed item response is governed by an unobservable candidate ability variable, 8, and the item difficulty. The probability of a candidate with an ability level of 8 answering an item correctly can be modeled by the mathematical function (Hambleton & Swaminathan, 1985): e(8-bi) pi(e)
= 1+e(8-bi)
i = 1 , 2,..., n,
where Pi(@ is the probability that a randomly chosen candidate with ability 8 answers item i correctly, and is an S-shaped curve with values between 0 and 1 over the proficiency scale; bi is the ith item difficulty level; n is the number of items in the test; e is a transcendental number whose value is 2.71828. . . . When the assumptions underlying the Rasch model are tenable, item difficulties and candidate abilities can be estimated and placed on the same scale through Rasch scaling procedures. In ,contrast to the other IRT models, Rasch measurement gives a useful relationship between candidate ability and item difficulty: the distance between a candidate’s ability and an item’s difficulty determines the probability of a correct response to the item (Wright & Stone, 1979). The distinct feature of the Rasch model is that only one item parameter (i.e.. item difficulty) determines the probability of success on the item. When candidate ability equals item difficulty, the probability of a correct answer to the item is S O . Anyone with an ability level lower than the item difficulty will have less than S O probability of success, and anyone with an ability level higher than the item difficulty will have greater than S O probability of answering the item correctly. By utilizing this distinguishing feature of the Rasch model, the item-mapping method enables judges to establish the passing score at the point where the MCC’s ability level equals the item difficulty. Questions often arise as to why the item-mapping method uses the response probability (RP) value of S O as opposed to other RP values in setting a cut score, and whether the use of different RP values would systematically affect cut score placements. Other standard-setting methods that utilize the IRT technique in setting cut scores may use an RP value different than S O . For example, in the Bookmark standard-setting procedure, a .67 chance of mastery has been suggested (Lewis, Green, Mitzel, Baum, & Patz, 1998; Mitzel, Lewis, Patz, & Green, 2001). The use of the S O FW in the item-mapping method is based on a number of reasons. For the one-parameter IRT model (i.e., the Rasch model), for which no guessing factor is taken into consideration for candidates with low abilities, the item information is maximized when the probability of success is S O (Hambleton & Swaminathan, 1985). This assures that the standard-settingprocess selects the cut point where the items can best discriminate between MCCs with abilities just below a 50 probability of answering the items correctly and those with abilities just above a S O probability of answering the item correctly. 238
An Item-Mapping Method
The RP value of S O may be more appealing than other RP values because common sense suggests that the 0.50 point marks the dividing line between what MCCs “cannot do” and “can do” (Zwick, Senturk, Wang, & Loomis, 2001). The most important reason, however, for using the RP value of S O in item mapping is its practical convenience. One practical advantage of the item-mapping method is its graphic presentation of item difficulties (i.e., item map) in determining a minimal level of competency. In such a graphic representation, the relationship between candidate ability and item difficulty is critical in helping judges understand what an MCC knows in terms of the item difficulty scale. On the Rasch measurement scale, maximum utility from the item difficulty-candidate ability relationship is gained when the S O RP value is used, because at this particular point, candidate ability equals item difficulty and no mathematical adjustment is needed to understand the minimal competency point. If a different RP value such as .67 was used for item mapping, the difference between ability and item difficulty corresponding to the .67 RP value would be 0.693 (i.e., ln2), and not zero. Because of this difference between ability and item difficulty at the .67 RP value, mathematical adjustments would be needed for judges to identify where minimal competency is located in terms of the item difficulties shown on the item map. Like all IRT-based standard-setting methods, more research is needed to evaluate whether using alternative RP values would systematically affect cut score placements (Mitzel, Lewis, Patz, & Green, 2001). As Kane (1994, 1998) indicated, a passing standard specifies the required level of mastery and conceptualizes minimal competency. The standard describes what candidates who meet the required level of performance should know. The cut score is the numerical result of the standard setting. The cut score is a number on a particular score scale, and the passing standard represents the intended interpretation of that number in terms of what the test is intended to measure. Therefore, theoretically speaking, the use of alternative RP values should not have systematic influences on cut score placements. Applying an alternative RP value will only add or subtract a constant from the true passing standard and only represent a different score scale. In practice, however, this may not necessarily imply that a reliable and valid cut score would always be obtained by using an alternative RP value. Misunderstanding the RP criterion might introduce some bias into the judgment of a standard setting. For example, using the .67 RP criterion may possibly result in a higher cut score than using the S O RP criterion, if judges conclude that the RP criterion of .67 represents a higher passing standard. As suggested by previous researchers (Mitzel et al., 2001), further research is recommended to determine whether the use of alternative RP values would affect a cut score placement for all IRT-based standard-setting methods. In addition, carefully designed studies are needed to examine whether the best use of the relationship between candidate ability and item difficulty (i.e., applying the S O RP criterion) in item mapping would enhance the validity and reliability of a cut score placement.
The Bookmark and the Item-Mapping Methods A well-known IRT-based standard-setting method, the Bookmark method, has also been developed to remedy the cognitive complexity of the Angoff method and 239
Wang
has been applied to educational assessments (Lewis et al., 1998; Mitzel et al., 2001). Both the Bookmark and the item-mapping methods attempt to provide less cognitively complex approaches for judges to determine a cut score, as compared to the Angoff method, by integrating IRT methodology into the standard-setting process (i.e., providing judges with structured item difficulty information). Another similarity between the Bookmark and the item-mapping methods is that judges are not required to provide percentage estimates of item mastery for the MCCs, which is considered a relatively difficult cognitive task. Even though both methods are IRT-based, a number of differences between the Bookmark and the item-mapping methods are notable, given the fact that the two methods were developed independently to serve different purposes-Bookmark was designed to fulfill the needs of educational assessments (Mitzel et al., 2001) and item mapping was primarily developed for licensure and certification examinations. A remarkable difference between the two methods is shown in the way that the IRT information is utilized to provide guidance to judges’ decision making. The Bookmark method is the evolvement of an “IRT-Modified Angoff Procedure” (Lewis et al., 1998), which integrates the IRT methodology into the Angoff itemrating process. By giving item difficulties a rank order (i.e., ordered item booklets) scaled from IRT, and by an iterative rating process (normally three rounds), the Bookmark method provides judges with structured item information to guide decision making (Lewis et al.; Mitzel et al.). The Bookmark method uses the item difficulty information to structure (i.e., order) items in such a way that the judgment task can be constrained to determine which content should be mastered, and which content need not be mastered for the MCCs (Mitzel et al.). In contrast, the item-mapping method not only provides judges with structure item information but also presents the information in a simple graphic histogram. This histogram gives a global view of the items’ relative standing in an intuitive and visual manner. The item-mapping method uses the item difficulty information in such a way that the item difficulty-candidate ability relationship can be maximally utilized in determining the ability level representing minimal competency in terms of item difficulty. Other differences are involved in the standard-setting process. As stated by the developers, the Bookmark method is the evolvement of an “IRT-Modified Angoff Procedure” (Lewis et al., 1998), and still requires judges to conduct item-by-item ratings as in the Angoff method. Also, an iterative rating process, normally three rounds, is needed in the Bookmark method. In item mapping, only a representative sample of items (representing both item-difficulty levels and test-content areas) drawn from the item map-not every single item-needs to be rated. Therefore, the item-mapping standard-setting process is less time consuming and reduces raters’ fatigue and the cost of a standard-setting study, as compared to the Angoff and Bookmark standard-setting process. Experience indicates that for an item pool consisting of 300 items, at least 2 days are needed to rate all items using the Angoff method, whereas only 1 day is typically needed to set a cut score using the item-mapping method. As it was designed, the Bookmark was developed to fulfill the needs of educational settings where multiple levels of mastery need to be established to measure
240
An Item-Mapping Method
educational progress, and multiple types of test items (e.g., multiple-choice and essay questions) are used in the assessments (Lewis et al., 1998). The itemmapping method was developed for large-scale multiple-choice licensure and certification examination programs, where a single passing score is typically needed to assess competency. Given that multiple equivalent test forms are needed for such an examination to ensure test security, item mapping is useful from time-saving and cost-efficiency perspectives, because of the resulting large number of test items used in setting a cut score. Since item mapping reduces the time required for cut score studies, the method is beneficial when hundreds of items are used for a single cut score study. Bookmark and the item-mapping methods are likely interchangeable in both educational settings and licensure and certification, although application of the Bookmark method in the licensure and certification fields may be cost and time prohibitive. Further research is needed to investigate what, if any, differences may exist in applying the two methods across both educational and licensure and certification environments.
Four Standard-SettingStudies Data Sources
Standard-setting studies were performed on four diverse multiple-choice professional licensure or certification examinations, using both item mapping and the Angoff methods. The first study was conducted in July 1999 for the ethics examination of a state insurance licensing program. The second was conducted in August 1999 for the property examination of a state insurance licensing program. The third was conducted in August 1999 for the casualty examination of a state insurance licensing program. The fourth one was conducted in December ZOO0 for a national nurse aide certification examination. Thirteen judges participated in the f i t cut score study where 43 items were rated by the same judges using both the Angoff and item-mapping methods. Six judges participated in the second and third cut score studies, where 73 items for the second exam and 79 items for the third exam were rated by the same judges using both the item-mapping and Angoff methods. The second and the third exams are the two different exams for the same state insurance licensing program. Therefore, the same judges participated in both cut score studies in a consecutive time period. None of the judges used in the second and the third studies participated in the f i t study. The second and third studies were conducted for a different state and, naturally, had different judges from the first study. For the fourth standard-setting study, eight judges (again different than the judges in studies one through three) rated 53 items using both Angoff and item-mapping methods. All judges were subject-matter experts (SMEs) in their professions. For the three insurance standard settings, judges were managing general insurance agents, insurance agency owners, educators, and instructors. For the national nurse aide cut score study, the SMEs were registered nurses and licensed nurse practitioners. In addition to their qualifications, all SMEs were selected based on two criteria: geographical representation of the test-taker population and representation of the profession. 24 1
Wang
To minimize the order effect of the standard-setting methods, the Angoff method was used first in the first and the fourth studies, while the item-mapping method was used first in the second and the third studies. Scaling and Goodness of Fit
Test items used in each exam were f i s t reviewed and approved by a committee of SMEs for pretesting in the field. After pretesting, the Rasch IRT model was used to calibrate the items using BIGSTEPS (Linacre & Wright, 1995). Before an item was selected to be included in a standard-setting study, two commonly used goodness-of-fit statistics in the Rasch model, weighted total fit and unweighted total fit, were examined to assess model data fit (Linacre & Wright; Wright & Masters, 1982; Wright & Stone, 1979). Both the weighted- and unweighted-fit statistics essentially calculate the difference between the observed frequency of candidates answering an item correctly and the expected frequency of candidates answering the item correctly. The expected frequency is predicted from the Rasch IRT model. Each of the unweighted and weighted fit statistics has an expected value of 0 and standard deviation of I (i.e., the asymptotic sampling distribution of the statistic is the standard normal distribution N[O,l]) (Smith, 1991). In generating the item map for each item-mapping study, any item with the absolute value of either one of the fit statistics greater than 2 was considered a misfit item and was excluded from the standard-setting study and would not be used in an examination. Further research could investigate whether a different cut would be established if misfitting items were included in the standard setting. To make a consistent comparison, those misfit items were also excluded from the corresponding Angoff standard setting. Procedures
Each of the cut score studies was conducted by an experienced test developer and/or psychometrician using both methods. All of the standard-setting studies began with an overview of the examination program and an introduction to standard setting in general. Then, the characteristics of MCCs for the licensure or certification were identified from an extensive discussion among the judges. Minimal competency was defined as the minimal level of knowledge and skills required for competent performance at an entry-level position (i.e., a person with just sufficient knowledge and skills to receive the license or certificate). Rather than just having judges acquire the terminology, the goal of defining the characteristics of MCCs was to stimulate a discussion about a variety of tasks that an entry-level practitioner needed to perform on the job, and about the necessary knowledge, skills, and abilities that were required for public protection. As an example, the Appendix provides the outlined characteristics of MCCs identified by the judges at the standard setting for the National Nurse Aide Certification Examination. Following the discussion of MCCs, depending on which method was going to be used, the Angoff method or IMM was presented in detail to the judges with the focus of discussion on questions and answers. Afterward, judges were instructed to begin their ratings using the method presented. Once a cut score was established using the first method, judges were introduced to the second standard-setting method and 242
An Item-Mapping Method
were given instructions for the use of this method. Again, questions and answers regarding the standard-setting method were discussed to assure a full understanding of the procedure. For the Angoff method, prior to proceeding with actual ratings, judges were given an opportunity to practice on a sample of test items. In using the Angoff method, for a given test item, judges were asked the question “What percentage of MCCs will answer this item correctly.” After conducting their initial ratings independently, judges announced their individual item ratings to the entire panel of judges. Following that, item statistics such as p-values and distractor analyses were shown to the judges. Based on the item statistics and a discussion of their initial ratings with the group, judges were given the opportunity to change their ratings. Each judge’s final percentage rating on every item was recorded. These percentage ratings were aggregated across items and averaged across judges to yield the cut score for each examination. When using the item-mapping method, judges were first asked to consider a real and typical MCC in their training program or class. Then, the item map was presented to the judges and individual items were selected one at a time from the map. Judges were asked to independently determine “whether the typical MCC has at least a S O chance of answering the item correctly” (i.e., provide a yes or no answer), and then to announce their ratings to the group. If consensus from judges was not reached, the standard-setting facilitator provided judges with item statistics such as p-values and distractor analyses to foster a consensus by stimulating a discussion about the rationale of their initial ratings. Afterward, judges were given the opportunity to change their ratings. Every judge’s individual final rating on each item was recorded as a value of “I” if the judge agreed that an MCC had at least a S O chance of answering the item correctly, otherwise, a “0” was recorded for the judge’s item rating. Usually, several items, but not all items from each column representing the same level of scaled item difficulty were selected to determine whether judges would reach consensus for the items.3 If there was consensus, the facilitator selected more difficult or easier items from other columns and repeated the process. It should be noted that consensus has two meanings here, consensus from judges on each individual item as well as consensus that the MCC had at least a .50 chance of answering successfully for the chosen items from the same column. If there was no consensus on either of the two meanings, more items from the same column were selected to rate and more items from the surrounding columns (one column to the left or right) were also selected to rate. The process was repeated until a column of items was located where most items in the column had consensus from most of the judges that the probability of a correct answer for the MCC was S O . The corresponding middle level of item difficulty under this column was established as the cut score. At each item-mapping study, a complete picture of the MCC’s ability in terms of the probabilities of correctly answering any item on the item map was also generated after the cut score was determined. With the probability ruler (see Figure 2) aligned so that the S O marker was centered over the column selected to be the cut, the probabilities of a correct response for all items for the MCC were identi-
243
Wang
fied. This complete picture of the MCC’s ability was very helpful in determining whether an initial cut score met judges’ conceptual understanding of minimal competency and whether more iteration of item ratings was needed to adjust the initial cut score during the standard-setting process. A histogram representing the distribution of candidate performance from the entire group of test takers was also presented to predict the passing percentage, given the cut score established at each item-mapping study. This gave judges feedback on the reasonableness of the proposed cut score, given the distribution of candidate abilities. After each item-mapping standard setting, judges were asked to evaluate the cut score and determine whether it met their conceptual understanding of minimal competency. In addition, judges had the opportunity to express whether the predicted pass rate from the item-mapping study met their expectation of the candidate population. If the Angoff method was used before the item mapping, judges were also asked to compare the cut scores from the two methods to determine which method resulted in a more reasonable cut. Data Analysis
Rating data from both’ the Angoff and item-mapping methods for the four standard settings were analyzed using methods in Generalizability theory (G theory). G theory assumes that a behaviorial measurement is a sample from a universe of admissible observations (Cronbach, Gleser, Nanda, & Rajaratnam, 1972). In G theory, the degree to which the sample measurement can be generalized to the universe is estimated using generalizability (G) and decision (D) studies. The purpose of a G study is to identify and estimate the magnitude of the sources of variances in the measurement. A D study makes use of the information provided by the G study to evaluate the effectiveness of alternative designs for minimizing error and maximizing reliability (Brennan, 1992; Shavelson & Webb, 1991). To answer the question of which method yielded greater inter-judge consistency, a rater-by-item (r x i) random-effects G study was first conducted for each of the eight data sets (4 exams x 2 methods). The magnitudes of variances due to sampling judges and items were compared for both methods across the four standard settings. Then, to evaluate rater agreement given the measurement condition (i.e., the combination of the number of raters and the number of items for each standard setting), the variance components estimated from the G studies were used to calculate dependability coefficients (9) for the eight D studies (4 exams x 2 methods) (Brennan & Lockwood, 1980). The GENOVA computer program (Crick & Brennan, 1983) was used in the G and D studies. Cut scores for each standard setting from each method were also calculated using three different score scales: raw score (the number of correct answers), percentage score (percent of correct answers), and scaled level of item difficulty used on the item map. Results and Discussion
Table 1 provides estimates of the variance components from rater-by-item (r x i) random-effects G studies and dependability coefficients (4) from D studies for the 244
An Item-Mapping Method
Angoff rating data on each of the four exams. Table 2 provides the same information from the item-mapping rating data. The percentage distributions of variances for inter-item, inter-rater, and rater-item interaction for each exam are also displayed in these tables. Results indicate that for each standard-setting method, distributions of variation due to each factor of the design are similar across the four exams, but different distribution patterns are found for the two methods. For the item-mapping method, the variances due to items account for approximately 80% of the total variance across the four exams, whereas it only ranges from 40% to 50% for the Angoff method. While the variability due to raters accounts for a small amount of the total variance in both methods, a relatively larger percentage was found for the Angoff method (1.4% to 5%), as compared to the item-mapping method (0% to 0.2%). Relatively larger proportions of variances are found due to the interaction between rater and item for the Angoff method (45% to 56%) than the item-mapping method (16% to 21%). Given the single-faceted design of the analyses, this variance component is confounded with both measurement error and differential rating of items across raters. For the item-mapping method, these results reveal that the variation due to item sampling dominates the total variance, while the variation due to sampling raters accounts for a negligible amount of the total variation. For the Angoff method, the variation due to measurement errors and differential rating of items across raters accounts for a large proportion of the total variance. The variation due to rater sampling accounts for a small amount of the total variance for the Angoff method, but it is still a larger amount than that found for the item-mapping method. These results indicate that judges provide more consistent ratings in the item-mapping than in the Angoff method. The dependability coefficients (+) estimated from D studies provide additional information on inter-judge consistency. Each dependability coefficient was computed using the variance components estimated from the G study and the same sample size (i.e., the number of raters and items) that was used in the standard setting. The GENOVA program provides estimates of the dependability coefficients (+) using the equation
4 =
2 ffi
u;+ az/n,
+ &/n,
where a: is the estimated variance for items, a: is the estimated variance for judges, a:,, is the estimated variance both for the interaction between judges and items and measurement error, and n, is the number of judges (Brennan, 1992). The dependability coefficient yields information on the average consensus among judges in their item ratings and serves as an index of rater agreement (Brennan, 1992; Brennan & Lockwood, 1980). Results reveal that across the four standard settings, judges reach higher agreement in the item-mapping than in the Angoff method. All ratings from the item-mapping method reach rater agreement higher than 0.95, whereas the rater agreements for the Angoff method range from 0.796 to 0.922. The degree of difference in rater agreement between the two methods varies from exam to exam. The third exam shows the largest difference (0.796 versus 0.958), and the first exam has the closest rater agreement (0.922 245
Wang
versus 0.985). The rater agreement index provides similar results to the findings from the estimates of variance components, that the judges provide more consistent ratings using the item-mapping than using the Angoff method. Since the Angoff method was used first for two exams and the item-mapping method was used first for the other two exams, one concern of this study was whether the implementation order would affect the results. Although item mapping resulted in higher rater agreement than the Angoff method across the four exams, it appears that the presentation order may also contribute to the discrepancy in dependability coefficients. Tables 1 and 2 show that the dependability coefficients (4) for the two methods are relatively closer when the Angoff method was presented first (the difference is 0.063 for the first exam and 0.078 for the fourth exam), and more discrepant when the item-mapping method was presented first (the difference is 0.108 for the second exam and 0.162 for the third exam). To provide a better design in comparing the two methods, two separate groups of judges, each using one presentation order to set a cut score for the same exam, should be considered in future research. Table 3 provides the cut score established by each method for each examination as well as the predicted passing rate, 6iven the established cut score. The cut scores are reported on three scales: the number of items (raw cut score); the number of items in terms of percentage of the total items; and the converted item difficulty TABLE 1 Estimates of variance components for item x judge G studies and dependability coeficients from D studies (AngofJ) Sources of Variance
Exam 1st
#Item #Rater 43
Judge
%
13
93.34 (21.30) 47.6%
Item
2.72 (0.31)
1.4%
%
Item x Judge
%
c$
99.85 (1.03) 51.0% 0.922
2nd
73
6
86.42 (16.42) 49.6%
7.81 (0.79)
4.5%
79.84 (0.99) 45.9% 0.856
3rd
79
6
30.63 (6.00) 39.5%
3.78 (0.39)
4.9%
43.20 (0.51) 55.7% 0.7%
4th
53
8
17.83 (3.82) 50.2%
1.74 (0.12)
4.9%
15.98 (0.15) 44.9% 0.890
Note: A number in parentheses represents the standard error associated with the estimate.
TABLE 2 Estimates of variance components for item x judge G studies and dependability coeficients from D studies (item map)
Sources of Variance
Exam #Item #Rater
Item
Itemx Judge
43
13
0.210(0.04) 83.5% 0.0003 (0.0)
0.1%
0.041 (0.0)
2nd
73
6
0.167 (0.03) 81.9% 0.0003 (0.0)
0.2%
0.037 (0.0)
17.9% 0.964
3rd
79
6
0.158 (0.03) 79.1%
0.0003 (0.0)
0.2%
0.041 (0.0)
20.7% 0.958
4th
53
8
0.194 (0.04) 79.0%
0.0 (0.0)
0.0%
0.051 (0.0)
21.0% 0.968
%
Judge
c$
%
1st
%
16.3% 0.985
Note: A number in parentheses represents the standard error associated with the estimate.
246
An Item-Mapping Method
scale used on the item map. It is noted that the item-mapping method sets a lower cut score than the Angoff method. This is a consistent result for all four examinations. The magnitudes of the cut score differences vary from exam to exam, ranging from a difference of 9% to 19% in terms of the percentage of total items. Corresponding to the cut scores set by each method, the predicted percentages of passing are lower for the Angoff method than for the item-mapping method across the four exams. The pass rate differences between two methods vary from exam to exam, ranging from 22.6% to 51.3%. Experience in using the Angoff method for licensure and certification exams shows that judges normally give item ratings between 70% and 80%; for easier items, ratings may rise to 90% or 95%, and for harder items, ratings may decrease to 50% or 60%. It rarely occurs that item ratings go below 50%. These rating ranges occurred not only in the standard settings reported in this study, but also in all other standard settings conducted by the authors for licensure and certification examinations. As a result, Angoff cut scores are typically set between 70% and 80%, even for exams containing many difficult items. However, when the itemmapping method is used, judges do rate difficult items more severely (i.e.. an MCC has less than a S O chance of answering correctly). Once a cut score is determined and the picture of the MCC’s ability is portrayed using the probability ruler shown in Figure 2, judges are comfortable with their relatively low ratings and believe that the final cut meets their conceptual understanding of minimal competency. Similar to the authors’ experience, overestimates of item performance for target candidate populations were also found in studies using the Angoff or other test-centered standard-setting methods that require judges to conduct a similar task of rating item by item (Goodwin, 1999; Plake, 1998). A researcher (Klein, 1984) also discovered that judges using the Angoff method based their performance estimates on average or above-average candidates, rather than on the MCC, resulting in ratings being unrealistically high. Following the determination of the cut score in each item-mapping standard setting in this study, judges as a group were asked to discuss two questions: whether the cut score met their conceptual understanding of minimal competency TABLE 3 Cut scores and the scaled logit cuts established by each standard-setting method 1st study 13 judges & 43 items
3rd study 6 judges & 79 items
4th study 8 judges and 53 items
Angoff
Item Map
Angoff
Item Map
Angoff
Item Map
Angoff
Item Map
(%)*
29 (68%)
22 (51%)
49 (67%)
36 (49%)
53 (67%)
38 (48%)
43 (81%)
38 (72%)
Scaled Logit cut
Ill
101
109
101
110
101
118
108
Predicted Pass Rate
28.1%
79.4%
59.2%
91.8%
62.4%
92.3%
72.0%
94.6%
Raw Cut Score
*
2nd study 6 judges & 73 items
A number in parentheses represents the number of items in percentage.
247
wung
and whether the predicted pass rate met their expectation of the test takers. For all four exams, the majority of judges believed that the cut score met their conceptual understanding of the passing standard, given the portrait of the MCC’s ability shown by the probability ruler. Also, most of the judges were satisfied with the predicted pass rate. For the two exams where the Angoff method was used first, judges were also asked to have a group discussion comparing the cut scores set by the two methods and to discuss whether they were comfortable in using either one of the methods in setting a cut score. Judges agreed that the item-mapping method set more realistic cut scores than the Angoff method. In terms of comparing the standard-setting methods, some judges indicated that they were not very clear about what the process would do at the beginning of the item-mapping process, but after a few items were rated, they became more comfortable with the scope of the process. This comment suggests that when using the item-mapping method, both a full explanation and practice are needed at the beginning of the study. All judges expressed their appreciation for the less time involved in the item-mapping method in comparison with the Angoff method. To provide more data regarding judges’ perspectives on comparing standard-setting methods, a formally developed questionnaire should be used in future studies.
Conclusions and Directions for Future Research Licensure and certification testing provides a dual service. For the practitioners, it can provide employment opportunities. For the public, it ensures a certain level of safety. Neither service can be provided adequately without setting a justifiable and valid passing score. In current practice, the Angoff method is the most commonly used standard-setting method (Kane, 1995; Plake 1998). Given the criticism of cognitive difficulty and inaccurate item performance ratings in the Angoff method, this article advocates the use of the item-mapping method for setting cut scores. Both logical and empirical evidence is provided to give legitimacy and support for the item-mapping method. Problems typically confronting judges in test-centered standard-setting methods are to conceptualize a group of MCCs, and to accurately estimate item performance for the group of MCCs (Impara & Plake, 1997). The item-mapping method offers resolutions for these difficulties, all related to reducing the cognitive complexity of the judges’ decision-making processes. Asking judges to think of a real and typical MCC in a training program or class helps judges conceptualize the MCC, and simplifies the judges’ thinking processes in determining the probability of an MCC answering an item correctly. Providing only a yes or no answer to the question “Does the MCC have at least a S O chance of answering an item correctly?” in item mapping is an easier task than estimating an absolute probability of a correct response in the Angoff method. The most important and unique feature of the item-mapping method is that a global picture of item difficulties is presented to judges during the standard-setting process. Visually presenting item difficulties on a global item map is effective at alleviating the problem of judges being unable to provide accurate and consistent estimates of MCC’s item performance. The item-mapping method uses the Rasch IRT model as a basis for standard setting. Through application of the Rasch IRT model, item difficulties are estimated 248
An Item-Mapping Method
and placed on the same scale as candidate abilities. The passing score is established by identifying the difficulty/ability point on the item difficulty histogram chart where MCCs have a .50 chance of answering the items correctly. The .50 chance of success is chosen as the probability criterion in item mapping for a number of reasons. For the Rasch model, the item information is maximized at the point where the probability of success is .50. In addition, the S O probability value corresponds to the unique point on the Rasch measurement scale, where candidate ability equals item difficulty. Since item mapping aims to determine the ability level representing minimal competency (i.e., cut score) in terms of item difficulties, the .50 probability criterion provides the best point for judges to utilize the item difficultycandidate ability relationship, and to understand the minimal competency point on the item map. However, similar to other IRT-based standard-setting methods, such as the Bookmark procedure, future research is needed to examine whether the use of alternative RP values would systematically affect a cut score placement. In particular, future research is still needed to provide empirical evidence on whether the use of the S O RP criterion in item mapping would, in practice, enhance the validity and reliability of a cut score placement, as compared to the use of other RF' values. By analyzing data from four standard-setting studies for a diverse set of professional licensure and certification examinations, this study provides empirical evidence for better decision consistency and more justifiable results from the itemmapping method than from the Angoff method. By examining the percentage distributions of the variance components from rater-by-item G studies and the dependability coefficients from rater-by-item D studies, this study finds that the item-mapping method allows for higher inter-judge consistency than does the Angoff method. Another finding of this study is that the item-mapping method sets consistently lower cut scores than the Angoff method. Once a cut score is determined and the picture of the MCC's ability is portrayed, judges are comfortable with their relatively low ratings from the item mapping, because the final cut meets their conceptual understanding of minimal competency. This is consistent with other research findings (Goodwin, 1999; Klein, 1984; Plake, 1998), showing that unrealistically high ratings may be obtained from the Angoff method or from other test-centered standard-setting methods that require judges to rate item by item. Researchers have recommended the use of an alternative approach of the Angoff method, which is a better choice than the traditional approach (Impara & Plake, 1997). Unlike the traditional Angoff method, the alternative approach asks judges simply to indicate whether an MCC will be able to answer each item correctly (yesho method). The judgmental thinking used in this method is closer to the item-mapping method than is the traditional approach of the Angoff method. More research should be conducted to compare this yesho approach of the Angoff method with the item-mapping method to better examine the enhancement of item mapping over the Angoff method. In addition, in comparing the methods, the Angoff method was used first for two exams and the item-mapping method was used for the other two exams in this study. Whether the implementation order would contribute to the discrepancy in rater agreement between the two methods is
249
Wang
still of concern. It is also suggested that future research should be conducted with two separate groups of judges, each using one implementation order to set a cut score for the same exam. To provide more validity evidence with respect to judges’ perceptions on the two standard-setting methods, a well-designed questionnaire should be used in future studies. The use of the item-mapping method requires previous statistical analysis of test data. All items and candidate abilities have to be estimated using the Rasch IRT model prior to the standard-setting study. Future research could be conducted to investigate whether a different cut would be set if items not fitting the Rasch model were included in the standard setting.
Appendix: Outlined Characteristics of Minimally Competent Nurse Aides Fourth to fifth grade reading and math levels. Meet minimum training program standards for satisfactory completion. Basic knowledge of resident’s rights. Understanding of the minimum guidelines outlined in the federal regulations. Basic understanding of the terminology used in health care. Understanding basic infection control standards. Knowledge of personal health and hygiene. Basic problem solving techniques. Ability to cope with the physical and mental demands of the job. Knowledge of basic normal and abnormal body functions. Basic knowledge of the principles of safety and emergencies.
Notes The author would like to thank Betty A. Bergstrom, Randall F. Wiser, Barbara G. Dodd, and two anonymous reviewers for their valuable suggestions for revising the article. I would like to acknowledge that my colleague, Larry Newman (deceased in February 2001), was instrumental in developing and utilizing the item-mapping method in licensure and certification standard-setting studies. An earlier version of this article was presented at the 2001 annual meeting of the National Council on Measurement in Education, Seattle, WA. In this article, the term “item mapping” refers to a standard-setting method that utilizes item maps to establish cut scores for licensure and certification exams. Another popular use of the term “item mapping” is in large-scale educational assessment programs (e.g.. NAEP) for “the use of exemplar items to characterize particular score points” (Zwick, Senturk, Wang, & Loomis, 2001, p. 16). The description of the item-mapping method in this section is limited to procedures that are specific to the item-mapping method. General implementation procedures that are required in a standard-setting process, such as reviewing test purposes and discussing characteristics of minimally competent candidates, are also applied to the item-mapping method. On an item map, each item is labeled by its ID number and is color- and symbol-coded by its associated test content area. During an item-mapping standard setting, items are not only selected to represent various item difficulty levels, but a sample of items is also selected from each content area to ensure that the rated sample of items is representative in both item difficulty levels and test content areas.
250
An Item-Mapping Method
References Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508-600). Washington, DC: American Council on Education. Bejar, I. I. (1983). Subject matter experts’ assessment of item statistics. Applied Psychological Measurement, 7, 303-310. Berk, R.A. (1996). Standard setting: the next generation. Applied Measurement in Education, 9(3), 215-235. Brennan, R. L. (1992). Elements of generalimbility theory. Iowa City, IA:ACT. Brennan, R. L., & Lockwood, R. E. (1980). A comparison of the Nedelsky and Angoff cutting score procedures using generalizability theory. Applied Psychological Measurement, 4,219-240. Chang, L. (1999). Judgmental item analysis of the Nedelsky and Angoff standard-setting methods. Applied Measurement in Education, 12(2), 151-166. Cizek, G. J. (1996). Standard-setting guidelines. Educational Measurement: Issues and Practice, 15(l), 13-21. Crick, J. E., & Brennan, R. L. (1983). A GENeralized analysis Of VAriance system [GENOVA]. Iowa City, IA: ACT. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizubilify of scores and pmfiles. New York John Wiley. Goodwin, L. D. (1999). Relations between observed item difficulty levels and Angoff minimum passing levels for a group of borderline candidates. Applied Measurement in Education, 12(1), 13-28. Grosse, M. E., & Wright, B. D. (1986). Setting, evaluating, and maintaining certification standards with the Rasch model. Evaluation and the Health Professions, 9, 267-285. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory. Boston: KluwerNijhoff. Hoge, R. D., & Coladarci, T. (1989). Teacher-basedjudgments of academic achievement: A review of the literature. Review of Educational Research, 59(3), 297-313. Impara, J. C., & Plake, B. S. (1997). Standard setting: An alternative approach. Journal of Educational Measurement, 34(4), 353-366. Impara, J. C., & Plake, B. S. (1998). Teachers’ ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement, 35( l), 69-81. Kane, M. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64(3), 425461. Kane, M. (1995). Examinee-centered vs. task-centered standard setting. Proceedings of Joint Conference on Standard Setting for Large-Scale Assessments, Washington, DC. Kane, M. T. (1998). Criterion bias in examinee-centered standard setting: Some thought experiments. Educational Measurement: Issues and Practice, I7(l), 23-30. Klein, L. W. (1984, April). Practical considerations in the design of standard setting studies in health occupations. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA. Lewis, D. M., Green, D. R., Mitzel, H. C., Baum, K., & Patz, R. J. (1998, April). The bookmark standard setting procedure: Methodology and recent implementations. Paper presented at the annual meeting of the National Council of Measurement in Education, San Diego, CA. Linacre, J. M., & Wright, B. D. (1995). BIGSTEPS computer program. Chicago: MESA Press.
25 1
Wang
McKinley, D. W., Newman, L. S., & Wiser, R. F. (1996, April). Using the Rasch model in the standard-settingprocess. Paper presented at the annual meeting of the National Council of Measurement in Education, New York, NY. Mills, C. N. (1995). Establishing passing standards. In J. C. Impara (Ed.), Licensure testing: Purpose, procedures, and practices (pp. 219-252). Lincoln, NE: Buros Institute of Mental Measurements. Mills, C. N., & Melican, G. J. (1988). Estimating and adjusting cutoff scores: Future of selected methods. Applied Measurement in Education, I, 261-275. Mills, C. N., Melican, G. J., & Ahluwalia, N. T. (1991). Defining minimal competence. Educational Measurement: Issues and Practice, 10(2), 7-10. Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2001). The bookmark method: Psychological perspectives. In G. J. Cizek (Ed.), Setting p e d o m n c e standards: Concepts, methods, and perspectives (pp. 249-281). Mahwah, NJ: Erlbaum. National Academy of Education (1993). Setting pedormance standards for student achievement. Stanford, CA. Author. National Research Council (1999). Setting reasonable and useful performance standards. In J. W. Pellegrino, L. R. Jones, & K. J. Mitchell (Eds.), Grading the nation's report card: Evaluating NAEP and transforming the assessment of educational progress (pp. 162-184). Washington, DC: National Academy Press. Norcini, J. J., & Shea, J. A. (1997). m e credibility and comparability of standards. Applied Measurement in Education, 10(1), 39-59. Norcini, J. J., Shea, J. A., & Kanya, D. T. (1988). The effect of various factors on standard setting. Journal of Educational Measurement, 25, 57-65. Plake, B. S. (1998). Setting performance standards for professional licensure and certification. Applied Measurement in Education, I I ( I), 65-80. Reid, J. B. (1991). Training judges to generate standard-setting data. Educational Measurement: Issues and Practice, 10(2), 11-14. Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A prime,: Newbury Park, CA: SAGE Publications, Inc. Shepard, L. A. (1995). Implications for standard setting of the national academy of education evaluation of the national assessment of educational progress achievement levels. Pmceedings of Joint Conference on Standard Setting for Large-Scale Assessments (pp. 143-160). Washington, DC: The National Assessment Governing Board (NAGB) and the National Center for Education Statistics (NCES). Simon, H. A. (1989). Models of thought (Vol. II). New Haven, CT: Yale University Press. Smith, R. M. (1991). IPARM: Item and person analysis with the Rasch model. Chicago: MESA Press. Stone, M. H. (1998). Rating scale categories: Dichotomy, double dichotomy, and the number two. Popular Measurement, 1(1), 61-65. Swanson, D. B., Dillon, G. F., & Ross, L. E. (1990). Setting content-based standards for national board exams: Initial research for the Comprehensive Part I Examinations. Academic Medicine, 65, 17-18. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA Press. Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago: MESA Press. Zwick, R., Senturk, D., Wang, J., & Loomis, S. C. (2001). An investigation of alternative methods for item mapping in the national assessment of educational progress. Educational Measurement: Issues and Practice, 20(2), 15-25.
252
An Item-Mapping Method
Author NING WANG is an assistant professor, Center for Education, Widener University, One University Place, Chester, PA 19013;
[email protected] primary research interests include development of performance-basedassessments, development and evaluation of standard-setting methods, as well as investigation of job analysis methodologies for licensure and certification examinations.
253