Comparing Construct Definition in the Angoff and ...

3 downloads 19 Views 482KB Size Report
Definition in the Angoff and. Objective Standard Setting. Models: Playing in a House of. Cards Without a Full Deck. Gregory Ethan Stone1, Kristin L. K. Koskey2,.
Comparing Construct Definition in the Angoff and Objective Standard Setting Models: Playing in a House of Cards Without a Full Deck

Educational and Psychological Measurement 71(6) 942­–962 © The Author(s) 2011 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0013164410394338 http://epm.sagepub.com

Gregory Ethan Stone1, Kristin L. K. Koskey2, and Toni A. Sondergeld1

Abstract Typical validation studies on standard setting models, most notably the Angoff and modified Angoff models, have ignored construct development, a critical aspect associated with all conceptualizations of measurement processes. Stone compared the Angoff and objective standard setting (OSS) models and found that Angoff failed to define a legitimate and stable construct. The present study replicates and expands this work by presenting results from a 5-year investigation of both models, using two different approaches (equating and annual standard setting) within two testing settings (health care and education). The results support the original conclusion that although the OSS model demonstrates effective construct development, the Angoff approach appears random and lacking in clarity. Implications for creating meaningful and valid standards are discussed. Keywords standard setting, Angoff, objective standard setting, construct validity Validity is a principle concern of all measurement. Nowhere is the concept more important than in high-stakes testing environments, where standard setting is used to inform critical pass/fail decisions made each day that affect the lives of thousands of students and professionals worldwide. Norcini and Shea (1997) provided a set of criteria for evaluating the credibility of standard setting methods including, but not 1

University of Toledo, Toledo, OH, USA University of Akron, Akron, OH, USA

2

Corresponding Author: Gregory Ethan Stone, University of Toledo, 2801 W. Bancroft Street, MS #921, Toledo, OH 43606, USA Email: [email protected]

Stone et al.

943

limited to, using multiple content experts in the process, reviewing the exam item by item, and using an absolute standard as opposed to a relative or norm-reference standard. In an extensive review of the validity of standard setting models, Kane (2001) further emphasized the importance of procedural validity (demonstrated by the systematic implementation of standard setting methodology), internal consistency (of judges), comparisons to external measures, and similar theoretically independent observations designed to evaluate the likely reliability of the processes and outcomes. Camilli, Cizek, and Lugg (2001) promoted the incongruous notion that “current ­standard-setting procedures can vary greatly, yet all of them have roughly equivalent internal systems of rules and procedures” (p. 471). They concluded that based on this similarity, if the rules and produces of any particular system are followed, it may be concluded that “essential validity evidence” has been provided. Speculating on the future of standard setting, these authors and others (Mehrens & Cizek, 2001) continue by asserting that such evidence provides a “hedge” against what Glass (1978) called the arbitrariness of standard setting. Unfortunately, neither offers such a hedge, and both overlook the most critical aspects associated with standard setting: validation of the process for developing a construct and the validity of the resultant construct. This study is concerned with these critical aspects of construct development. First, the methodology of construct development (validation) is described as central to the standard setting process. Two standard setting models, and two variants of each, are detailed and evaluated in terms of their ability to create meaningful and valid standards as applied in the fields of education and health care. Next, using an equating and anchoring process, the product of each model (i.e., the construct) is assessed for validity. Finally, we argue that an increased awareness of the importance of a more comprehensive understanding of the validation process (methodological validation and construct validity) within standard setting is essential for future development in the science. The fundamental questions of conceptual meaning, construct acceptability, and intuitive approach have for too long been lost in narrowly focused evaluations of individual elements of existing models. In 1978, Gene Glass referred to standard setting as “a meaningless application of numbers to a question not prepared for quantitative analysis” (p. 242). Twenty-three years later, Kane rightly noted that attempts to establish the validity of standards and overcome the objections of Glass had “proven frustrating.” Although Kane presented the need to understand standard setting holistically, he nevertheless continued by parceling the process into bits and pieces and associating validity not with the sum but with the parts. Instead, we believe it is essential to revisit the foundational principles on which the theory and assumptions behind standard setting are based.

Validity and Validation Construct development, construct validity, and the validation process are all centrally important concepts within measurement and a substantive evaluation of each has traditionally been absent from the majority of research on standard setting.

944

Educational and Psychological Measurement 71(6)

This consistent oversight may have developed out of the fractured nature of modern concepts of validity. Messick (1980) presented a compelling argument for reconsidering notions of validity. He argued that by breaking down validity into a series of smaller, minor parts, we run the risk that, in the resulting confusion, the more important holistic conception of construct validity is generally lost. Messick proposed that construct validation is a process of marshalling evidence to support the inference that an observed response consistency in test performance has a particular meaning. The process attempts to link the reliable response consistencies reflective of a presumably common underlying construct, usually an attribute or process or trait, that is itself embedded in a more comprehensive network. (p. 1015) In other words, these “parts” support construct validity, but when considered alone do not define construct validity. As Hood (2009) noted, however, Messick never fully described what may be considered to be sufficient evidence for the establishment of validity. Furthermore, others (Borsboom, 2005; Borsboom, Mellenbergh, & van Heerden, 2004; Sechrest, 2005) have since criticized Messick’s framework as being too dependent on correlational methods raising the question, for instance, “should we believe that a measure ‘has’ construct validity simply because it is correlated with some better-established measure?” (Sechrest, 2005, p. 1593). Furthermore, Messick’s framework is criticized for focusing too little on the primary and necessary condition for validity, that is, preexisting theoretical support for the existence of the construct being measured. Borsboom argues that conceptual validity is fundamentally ontological, focusing on the relative existence of the quality, and what is commonly referred to as validity is instead the epistemologically oriented “validation” process. Within this framework, the essence or presence of the construct is itself, validity. Sechrest (2005) has argued a similar point that it is the “correct specification of the construct in the first place” (p. 1586) that is essential for supporting validity. Similarly, in standard setting, the standard (itself a meaningful construct) and the procedure used to establish the resultant outcome should not be overlooked. Altering the procedures used to set standards even ever so slightly can alter the resultant outcome. Sechrest’s (2005) simple example documenting various procedures used to measure blood pressure to determine the most accurate measure demonstrates this latter point. He demonstrated how slightly altering the measurement procedure for what is conceived of as a well-established construct can affect the construct validity. Although Camilli et al. (2001) hold that all standard setting models are similar enough to result in a common outcome when a general set of rules are followed, this cannot be assumed. Regardless of the differences between Messick’s and Borsboom’s frameworks for validity, both share a common argument that a more holistic approach focusing on construct validity is necessary (Hood, 2009). In this study, we took a more holistic approach by considering the meaningfulness of the process used for standard setting and resultant construct.

Stone et al.

945

Construct Validity and Standard Setting Criterion-referenced standards, used in high-stakes testing, are designed to represent both a specific set of knowledge, skills, and abilities deemed critical for success and a level of test-taker performance within that set of abilities. The major product of a standard setting exercise should therefore be a construct representative of these requirements. To assess the validity of the product, it is insufficient to simply review the internal judgment making processes alone. Instead, as Messick implied, it is important to imbed the decision in a network or framework for what Borsboom might call the validation process. How the construct itself may be assessed for validity is of critical concern. Camilli et al. (2001) suggested that all standard setting models essentially follow the same rule. Specifically, whatever unique features of each model may exist, all revolve around the prediction of the success of minimally competent test takers on a series of items. Although certainly true of traditional models dating back to the 1950s, Camilli et al.’s assertion essentially ignores modern developments in standard setting that have taken place over the past 30 years, and in particular work that has emerged more recently. In 1970, Benjamin Wright and Martin Gross began experimenting with what were, at the time, newly conceptualized, Rasch-based (1960, 1980) models for considering the development of criterion standards at the National Board of Medical Examiners. Based in part on those theoretical beginnings, two new models and a host of tangentially related models emerged. The objective standard setting (OSS) model (Stone, 1995a, 1995b, 1996a, 1996b, 2001, 2003, 2004) and the bookmark model (Lewis, Mitzel, & Green, 1996) both represent clear departures from the traditional models of Angoff (1971), Ebel (1979), and Nedelsky (1954). In Setting Performance Standards (Cizek, 2001b), the bookmark model was briefly included as the single representative of the modern standard setting models. Cizek (2001a) incorrectly states that this model is substantively no different from the remainder of the other models he chose to include. This notion is incorrect. Traditional models ask expert judges to predict the performance of hypothetical test takers and then use those predictions to define the standard. Such predictions are only indirectly related to the content, via item difficulty, and even then, those item difficulties are speculative. The bookmark and OSS models originate from the theoretical assertion that content-based constructs must be rooted in construct-based decisions. All decisions made by expert judges participating in standard setting exercises are direct evaluations of content and content presentation relevance. There are unambiguous differences between both Rasch-based models and traditional models in both process and theory, and it was these differences that served as the impetus for the development of these new approaches. Such a fundamental lack of understanding demonstrates the inherent weakness of Cizek’s attempts at both validating the process and producing evidence of construct validity. Today, the Angoff model for setting standards is arguably the most widely used model and well represents those of its genre. This model is simple for users to calculate but provides little information to challenge the development of a construct, therefore making the illusion of validity far easier to establish. In contrast, the OSS

946

Educational and Psychological Measurement 71(6)

model, having developed from the early Rasch work of the 1980s, embraced defining the ability construct as the principle goal of the model. Once defined, its application to test takers was intuitive, considering the logistic ruler onto which both test takers and items may be placed. The presented idea of construct development vis-à-vis standards is relatively new. Conceptually, the framework arises from the definition of the criterion itself and the meaningfulness of criterion referencing as a process. One of the more fundamental reasons for adopting criterion models is their ability to meaningfully describe a successful performance, not in terms of relative positioning against contemporary test takers but through an understanding of knowledge, skills, and abilities. It is therefore reasonable to expect that the standard must describe that performance and not simply be represented by a cut score. If it is reasonable for total examination to be considered representative of a construct, it is then reasonable to consider the criterion standard as a construct as well; one that is representative of part of the whole, but one that represents more than a cut score. When considered in terms of the Rasch model, use of the term construct may be more out of convenience than necessity because any point along the larger (construct) continuum may be said to represent a content defined part of an entire construct; however, when comparing the process outcomes to traditional models it is required. Stone (1996b) presented results from a study comparing construct development within the Angoff and OSS models. The study demonstrated both the ineffectual nature of the Angoff process in defining clear, stable, and meaningful constructs and, conversely, the success of OSS in achieving this goal. This evaluation required that the standards, processes, and outcomes be considered within what Messick had earlier called a more comprehensive network of validity and within Borsboom’s framework of validation. Stone described three elements that interacted when standards are applied. Element 1, test takers, may be represented by their level of ability. From one administration to another, groups may demonstrate lesser or greater ability, and therefore group pass and fail rates naturally fluctuate. Element 2, examination items (or the examination), may be represented by the level of difficulty of the items presented. As with ability, groups of items may be of lesser or greater difficulty and tend to fluctuate each time a new test is constructed. Although testing authorities may attempt to carefully control the degree of variance, there will nevertheless always be some fluctuation. Finally, Element 3 represents the standard itself. Criterion-­ referenced standards are, as indicated, designed to represent a set of knowledge, skills, and abilities deemed critical for success. This set, and the corresponding performance level associated with it, define the construct. Unless fundamental ability requirements are changed from year to year, such constructs should remain consistent, and therefore no fluctuation should be expected outside of normal variance associated with measurement error. Unfortunately, as Norcini and Shea (1997) argued, one of the most obvious statistics useful for gauging the success of this process and its outcome, that of pass rates, is unreliable because of the state of flux in which they exist. Using observed pass rates as validity evidence represents a fundamental error made by standard setting researchers including Kane and Cizek. To assess whether or not Element 3 (the standard) is

Stone et al.

947

stable, and therefore representative of a clear, well-defined construct, Elements 1 and 2 must be controlled. Fortunately, Stone (1995a, 1995b) demonstrated that controlling for these two elements is possible. Specifically, Stone used Rasch common-item equating (anchoring) to control for item difficult, and traditional linear equating to control for test-taker ability, to account for the differences across administrations. With these artificially forced controls in place, the consistency of Element 3 (the standard) may be observed. If the standard is stable (supported by a well-defined construct), then the new passing rates should be consistent, when accounting for measurement error. In contrast, if the new passing rates fluctuate after controlling for both Elements 1 and 2 (item difficulty and test-taker ability), it suggests that Element 3 was neither clear nor well defined. In short, no construct would be defined in the case of the latter observation. Stone’s study assessed the holistic validation process and concluded that only the OSS model provided substantive evidence to support construct development. Stone’s study concluded that only the OSS provides substantive evidence that a clear, consistent, and valid construct has been defined.

The Current Study In this study, we replicated the methods used by Stone (1995a) and expanded the scope across 5 years and two different types of testing situations, within health care licensure and an educational setting. Specifically, the purpose of this study was to compare the Angoff and OSS models (both equated and annually established) to determine which yielded meaningful standards and defined the most stable construct. A stable construct definition is conceptualized here as a consistent passing rate over the course of a 5-year period, when differences in person ability and item difficulty are controlled. A description of the four standard setting conditions follows.

Method Sample Five high-stakes testing organizations participated in the study. Three of these organizations were responsible for medical certification examinations to determine board eligibility. Board eligibility is critical for physicians in practice to ensure their ability to acquire hospital privileges and insurance reimbursement. Each organization was responsible for a unique examination covering a specific medical discipline. The remaining two organizations were responsible for educational achievement examinations used to assess performance of eighth-grade students in the subjects of English and mathematics. The two educational examinations were used to assess achievement and largely governed promotion to high school. Each program was therefore also considered to be high-stakes. Standard setting panels were convened for each of the five organizations. Each panel consisted of 8 to 14 panel members considered to be experts in their respective fields.

948

Educational and Psychological Measurement 71(6)

Analysis Models and Procedures Two models for the definition of standards across the 5 years were employed. In addition, two variants of each model were used. The two specific variants were selected to correspond with what might be considered “best practices” and the originally defined expectations of each model to ensure that each was used in its optimal setting. Although the Angoff model (1971) was designed to be used within an equating framework, the nature of the process is based in true-score methodology. The rawscore predictions of success are not reasonably comparable from one year to the next (Wright & Stone, 1979). Stone (1995b) argued that equating from year to year should be considered inherent within the notion of a standard being criterion referenced. Nevertheless, the acceptability of equating within the original Angoff model is highly questionable, and therefore, in this study, Angoff standards were set in two ways. First, Angoff standards were set in 2004 (the first year of this study) and equated through 2008 (the last year of this study). Second, Angoff standards were reset (set again) in each year subsequent to 2004 (2005-2008). Doing so allowed for an exploration of Angoff standards both within a theoretically justifiable model and within the model explicitly described by Angoff. Standards defined by the OSS model are designed to be useful across time. Through the use of Rasch common item equating, the OSS standards have demonstrated effective use across years, as long as no change in the construct being assessed occurs. However, if standards based on reasonably stable constructs are developed, it should be possible for the standards to be reset each year such that they are largely equivalent. Therefore, OSS standards were both equated across years and reestablished each year. This approach allowed for both a direct exploration of construct development in the OSS and for a comparison with the optimal or classic Angoff model specifications. We describe the traditional Angoff model and OSS model next and the specific procedures followed in this study. Traditional Angoff model. The Angoff procedure requires expert judges to make two determinations. The first is accomplished as a group process. Judges (field experts) are asked either to define in precise detail their notion of “minimal competence” or are asked to review a preexisting definition of minimal competence and strictly adopt it as a group. This process permits judges to create their own definition rather imposing a definition on them. Evidence exists in the field of psychometrics, speech and language pathology, and social sciences that individuals are able to use self-defined scales to make reliable judgments (Beltyukova, Stone, & Fox, 2008; Eadie & Doyle, 2002; McColl & Fucci, 1999, 2006; Zraick & Liss, 2000). Self-defined scales have also been found to yield precise outcomes (Beltyukova, Stone, & Ellis, 2008; Beltyukova, Stone, & Fox, 2008; Lodge, 1981; Southwood & Flege, 1999), sometimes more precise compared with when a predefined scale was imposed on an individual (Beltyukova, Stone & Fox, 2008; Southwood & Flege, 1999). The definitional process begins by inspecting the examination’s current outline and the specific examination for which the standard is being set. Through a process of round-table discussions, the panel defines a body of knowledge considered to be of

Stone et al.

949

central importance (in these instances, the requirements of medical practice and educational advancement). Included in the requirements are all types of information and content areas considered essential to the reasonable performance of tasks associated with a successful outcome (passing score). Once the core knowledge has been identified, the judges as a group must define and describe a hypothetical minimally competent test taker. Judges are asked to answer the following question: “Of the body of defined knowledge, which pieces of information are vital to safe practice [in the case of medical examinations] or advancement to the next grade [in the case of educational examinations].” The information defined as vital (that which successful test takers must know) is the basis for the hypothetical, minimally competent test taker. The second judge-directed determination is made at the level of the individual judge. Using the examination for which the standard is to be set, each judge approaches each item and answers the following question: “What percent of minimally competent test takers will answer the item correctly?” Although the decisions are made independently, it is of great importance that the predictions across judges be identical. There is considerable evidence that an iterative process (i.e., providing feedback to the judges, including but not limited to the actual difficulties of the items being assessed) is the only way to attain the goal of consistency. Unfortunately, if the intent of the Angoff process is to define a criterion standard rather than simply a cut point, iteration produces confusion by introducing additional normative factors. In the instances where actual difficulties are used, speculative judge predictions are ultimately normed to actual test-taker performance, rendering those predictions irrelevant. The value of the original predictive exercise is therefore highly questionable. When instead the judges participate in consensus building exercises, judgments are instead normed across judges, for as long as it takes to reach a desired level of variability. Among the important guidelines established for standard setting in the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council of Measurement in Education, 1999) are diversity and individuality. It is important that panel members represent the full range of whatever may relate to the program (region, specialty, etc.). In addition, it is important that panelists complete the initial rating process individually to take full advantage of their expertise without “peer pressure.” To then participate in consensus building exercises, one, two, or three times, as this manual also suggests, until variability is reduced to the desired level, again defeats the purpose of those individual ratings. Angoff proponents, including Cizek and Kane, contend that through iterative manipulations judge variances may be minimized. Norcini, Shea, and Kanya (1988) advanced this conclusion by suggesting that judges tend to modify about one quarter of their predictions based on the actual performance values. Iterative processes will, without doubt, reduce variance since all judges are largely converging to the same point—the actual performance value. Brennen and Lockwood (1980) suggested that the utility and validity of the Angoff (and other nonobjective approaches) depend heavily on the extent to which the judges are in agreement. It may be reasonably extrapolated that if judges do not agree with one another particularly well, an error has

950

Educational and Psychological Measurement 71(6)

occurred during the definitional process. In this instance, the error would relate to the definition of the “minimally competent” individual. Angoff scholars argue that if the judges agree on the number of minimally competent test takers expected to correctly answer an item, it may be accepted that each judge has defined an identical notion of competency. The converse should also then hold true. When judges disagree, each may be said to have defined a different notion of competency. The most important question related to the iterative process is not, therefore, whether it works to reduce the variance of the standard. This claim cannot be disputed. Rather, the crux of the concern is whether iteration is rationale or whether it, as we argue, is theoretically unsound within the framework of the traditional models and leads only to a false sense of security in a now bewildered standard. In the Angoff model, judges are asked to define the performance of a test taker who is purely hypothetical. The task may be untenable for judges to adequately complete. Robert Ebel (1979) in discussing his own nonobjective standard setting model indicated that the determination of a minimum acceptable test-taker performance always involved some arbitrary and unsatisfactory decisions. The same criticism holds for the Angoff approach. The conventional percentage (or proportional) approach employed by Angoff theoreticians suffers from at least two related weaknesses. First, the percentage (or proportion) is an arbitrary construction without a clear justification. Differing judges with unique definitions of a mythical minimally competent test taker make the situation impractical. Despite extensive training, the inability of judges to consistently agree has been aptly demonstrated (Plake, Melican, & Mills, 1991; ­Saunders & Mappus, 1984). Second, substantial elements in the determination of the cut score are left to chance. The items may be more or less difficult, and more or less discriminating. Whether a test taker passes or fails may depend on the specific questions he or she encounters rather than his or her level of expertise, threatening the validity of the score meaning. In essence, the criterion of the traditional criterionreferenced standard is “number of items.” No content is specified in the criterion at all, merely a number (or proportion) of items. Referring to these speculations regarding minimal performance, Jaeger (1982) suggested that the prediction of performance was an unworkable task for judges and he thus eliminated judge speculation concerning “percentage of minimally competent examinees” from his system. Because of the conceptual and practical errors associated with iteration, such a practice was not employed in this study. The outcome of the Angoff model was therefore the sum of predictions for each judge. The grand mean across judges then became the final Angoff standard. Objective standard setting model. The OSS model is defined as a three-part exercise: defining the criterion set, refining the criterion point, and expressing the error. Defining the criterion set begins when expert judges are provided with a copy of an examination that was previously calibrated using the Rasch model. Each expert judge independently reviews the items presented on the examination and determines which of the items, as written, represents what is to be considered a core set of essential items. The notion of the term essential is of critical importance to the exercise. During

Stone et al.

951

training, judges spend considerable time discussing and defining this term. Furthermore, the term may be replaced by other common terms depending on the group. For instance, within the educational panels, expert judges considered the terms critical and core content to be more comfortable and meaningful for their use. However, for the health care groups, critical had very specific and unacceptable connotations. Therefore, the terms essential, vital, and most important were used with the medical panels. Once appropriate terms were defined for each group, each judge selected an individual set of essential items that would become his or her criterion set. To quantify the criterion set, the mean of the criterion item difficulties was calculated for each judge. The grand mean of all judges was used as the final quantification of the criterion set. The second step in the OSS model serves to refine the quantified criterion point. Because the unrefined criterion point obtained in Step 1 is based on a simple average, and therefore suffers to one extent or another from being calculated from a nonnormal distribution, it is advisable to seek ways in which the precision of the criterion might be improved. Over time, the methods for refinement have changed. The present and most successful option asks expert judges to reviewing a smaller subset of items and is loosely based on the Wright mapping approach discussed by Schulz and Mitzel (2009). The typical Wright mapping approach arranges all items on an examination in difficulty order along the defined variable continuum and asks expert judges to define the point along the variable when essential tasks become nonessential. Although theoretically sound, Stone (1995b) found the specification of a defined point on a construct continuum difficult in part because the information presented on a broad examination is often disconnected. Furthermore, while a passing point for one content area may be placed at Point 1, the passing point for a second content should be placed elsewhere. However, by reducing the number of items, expert judges must consider these difficulties may be reduced. Maps used for the refinement exercises were generated using the standard error of measurement (SEM) associated with each examination. The center of each map was set at the established criterion point. The quantified upper and lower ends of the map were set at 1.96 × SEM (or the 95% confidence level). Ten items were selected above and below the criterion point and were evenly spaced between the criterion point and upper and lower boundaries. An effort was made to maintain a reasonable content balance of items, although with a total of 20 items, it was not altogether possible. Expert judges were asked to individually determine at what point along the 20-item ruler the content of the items shifted from essential to nonessential and place a dividing mark accordingly. The point between the newly selected items became the refined criterion point for each judge, and the grand mean across all judges was used as the final, refined criterion point. Refinement is exceptionally useful because it adds information regarding the performance level of candidates reflected directly on the defined construct. The final step in the OSS model accounts for the error associated with the measurement process. All measurement processes generate error, and the OSS model accounts for error within its model statement in one of two fundamental ways. Examination organizations may choose either to protect the public by not passing any test taker in

952

Educational and Psychological Measurement 71(6)

whom they are not very confident earned a clearly passing score, or they may choose to protect the innocent test taker by only failing test takers about whom they are exceptionally clear should fail. In the former case, the standard practice suggests adding an amount to the criterion point linked to the SEM associated with the precision of the examination. In the latter case, an amount linked to the SEM of the examination would be subtracted from the criterion point. The amount added or subtracted would vary, depending on the level of precision the examination organization chose to adopt and the determined approach (protection of the public or innocent test taker). In the present study, all panels elected a centrist position, and as a result, no alterations to the refined criterion point were made. Model comparisons. Because the outcome of the traditional Angoff model is a proportion or percentage value, for the purposes of comparison, original Angoff standards were converted onto a logit scale equated to that derived through the OSS model. As indicated earlier, standards were set in two ways. First, standards were established in Year 1 of the study (2004) and equated in each of the subsequent 4 years. Second, Angoff standards were also set on an annual basis using both models. When set annually, the controls for differences in test-taker ability and examination difficulty are not applicable. Instead, pass rates annually established cannot be reliably compared from one year to the next directly. Rather, if reasonable constructs have been established, these annually established standards (definitions of ability) should be within error proximity of the equated definitions of ability. For the first set of comparisons, it is necessary to put in place controls for the fluctuations of item or test difficulty and test-taker ability. The former is easily accomplished using Rasch common item equating (Wright & Stone, 1979) that places each of the five annual examination administrations onto the same scale. Although individually each examination may be more or less difficult, their reference to the measurement of ability is identical. The latter, controlling for fluctuations in test-taker ability, can be most adequately achieved by traditional linear equating, such that the mean and standard deviation of test-taker scores for Years 2 through 5 are equated to that of Year 1. These controls allow for the intuitive review of the OSS standards, and in addition, because Angoff standards were converted to a logit scale, they too can be evaluated. Once applied, passing rates for each standard, across each of the 5 evaluated years, was calculated. For the second set of comparisons, each newly established standard was recalculated and plotted against their respective equated standards.

Results Tables 1 through 5 present calculated pass rates by judge, across each of the 5 administration years, for both the Angoff and OSS models using the equated exercise data. Total passing rates across all judges by examination year are presented at the bottom of each table. In reviewing the rates of passage, it was noted that the standard deviations of the passing rates associated with the OSS are significantly lower than those

953

Stone et al. Table 1. Medical Examination Panel 1 Yearly Pass Rates (by Judge) Angoff Objective Judge 1 2 3 4 5 6 7 8 M SD

2004 58 65 68 54 70 65 68 59 63 6

2005 75 77 76 79 80 72 74 78 76 3

2006 74 65 63 61 59 58 61 62 63 5

2007 83 84 81 75 78 74 82 77 79 4

2008 74 70 76 74 68 67 64 75 71 4

2004 71 71 68 67 70 71 70 67 69 2

2005 68 69 69 71 70 70 72 71 70 1

2006 70 73 72 71 71 70 70 71 71 1

2007 67 71 71 72 70 71 71 70 70 2

2008 67 67 69 69 70 71 70 71 69 2

2007 76 76 77 77 77 79 76 76 77 1

2008 76 77 77 75 75 75 76 78 76 1

2007 69 67 72 65 66 64 67 69 67 67 2

2008 65 64 64 67 66 67 67 64 66 66 1

Table 2. Medical Examination Panel 2 Yearly Pass Rates (by Judge) Angoff Judge 1 2 3 4 5 6 7 8 M SD

2004 86 89 92 95 91 86 85 84 89 4

2005 79 82 80 74 79 69 73 74 76 4

2006 68 67 69 74 73 76 72 67 71 3

Objective 2007 82 81 78 75 79 86 75 71 78 5

2008 60 64 72 73 80 71 68 60 76 1

2004 79 77 76 76 78 77 77 76 77 1

2005 76 77 75 76 76 77 78 76 76 1

2006 76 75 78 77 76 76 77 75 76 1

Table 3. Medical Examination Panel 3 Yearly Pass Rates (by Judge) Angoff Judge 1 2 3 4 5 6 7 8 9 M SD

2004 73 71 66 67 69 70 72 72 65 69 3

2005 70 65 68 73 74 72 66 66 62 68 4

2006 79 77 75 68 68 70 67 75 74 73 4

Objective 2007 84 85 82 80 79 83 83 82 86 83 2

2008 74 73 78 80 72 76 77 75 74 75 3

2004 70 66 67 68 68 69 67 68 64 67 2

2005 68 69 65 66 64 68 69 66 68 67 2

2006 69 68 66 68 80 69 68 66 67 68 1

954

Educational and Psychological Measurement 71(6)

Table 4. Education Examination Panel 1 Yearly Pass Rates (by Judge) Angoff Judge 1 2 3 4 5 6 7 8 9 10 11 12 13 14 M SD

2004 98 95 100 87 82 89 98 93 90 80 86 89 98 96 92 6

2005 97 99 98 88 82 97 99 90 89 99 89 86 97 98 93 6

2006 89 81 84 89 81 90 92 91 84 85 88 77 79 86 85 5

Objective 2007 94 91 93 85 98 95 92 83 81 99 84 89 90 93 1 6

2008 99 100 97 88 92 94 96 93 94 90 92 89 99 100 95 4

2004 92 93 92 92 92 93 93 93 92 93 90 91 91 93 92 1

2005 91 90 93 92 92 89 90 91 92 92 93 93 93 92 92 1

2006 93 92 92 94 93 93 92 94 93 93 92 92 94 93 93 1

2007 92 92 90 91 91 92 92 91 92 90 93 91 92 91 91 1

2008 91 94 94 92 92 93 93 94 92 91 92 93 92 92 93 1

2007 91 91 92 94 93 93 93 92 92 91 92 92 92 92 92 1

2008 91 91 90 89 90 92 91 91 92 91 91 92 91 92 91 1

Table 5. Education Examination Panel 2 Yearly Pass Rates (by Judge) Angoff Judge 1 2 3 4 5 6 7 8 9 10 11 12 13 14 M SD

2004 97 89 98 93 97 99 100 93 93 91 86 85 83 93 93 5

2005 95 97 90 82 89 94 92 94 85 84 92 95 87 83 89 5

2006 90 92 84 86 84 75 89 95 93 87 78 82 81 80 85 6

Objective 2007 84 93 96 83 94 92 96 93 78 82 86 80 91 90 88 6

2008 95 94 88 89 85 98 99 92 95 96 97 87 98 94 93 4

2004 89 90 90 92 91 92 92 92 91 91 92 91 91 91 91 1

2005 89 91 90 91 91 92 91 92 90 90 89 89 89 90 90 1

2006 92 91 90 91 91 92 91 92 91 92 91 92 92 91 91 1

associated with Angoff. This pattern was observed across judges and across years. Although the standard deviation of the passing rates across judges generally appeared as approximately 1% for the OSS, deviations for the Angoff model were typically at 4% or higher. There appears to be little difference in standard deviation by the type of

955

Stone et al. 90

85

80

75

70

------------------------------------------------------

65

60 2004 Objective Equated

2005

2006

Objective Yearly

------- = Objective Equating Line

2007 Angoff Equated

2008 Angoff Yearly

- - - - - = Angoff Equating Line

Figure 1. Passing rate comparison for Medical Examination 1

Note: Passing rates are compared across administration years for both objective standard setting and Angoff, and for both strategies, equated and annual. Circles represent the objective standard setting model passing rates, whereas squares represent passing rates associated with the Angoff model. Darkened symbols indicate the passing rates for the equated standards, whereas clear symbols indicate the passing rates for the annual standards.

examination (medical vs. educational) or number of members on the standard setting panel (larger or smaller). Figures 1 through 5 compare passing rates across administration years for both the OSS and Angoff, and both models, equated and annual established. Equating lines are placed on the figures for the standards established by the OSS and Angoff models. The equating lines are placed at the standard as originally established in 2004. For the standard to remain meaningfully stable, and thus to demonstrate the definition of a construct, passing rates for each subsequent examination should fall within the 95% confidence band of that line. An inspection of the figures proves exceptionally revealing. In examining the figures, it is evident that the controlled passing rates for the OSS-equated model remain well within the 95% confidence band. This finding suggests that, in all likelihood, the

956

Educational and Psychological Measurement 71(6)

90 --- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 85

80

------------------------------------------------------------------------------75

70

65

60 2004 Objective Equated

2005

2006

Objective Yearly

------- = Objective Equating Line

2007 Angoff Equated

2008 Angoff Yearly

- - - - - = Angoff Equating Line

Figure 2. Passing rate comparison for Medical Examination 2

Note: Passing rates are compared across administration years for both objective standard setting and Angoff, and for both strategies, equated and annual. Circles represent the objective standard setting model passing rates, whereas squares represent passing rates associated with the Angoff model. Darkened symbols indicate the passing rates for the equated standards, whereas clear symbols indicate the passing rates for the annual standards.

standard and the construct being measured are consistently well defined. Conversely, the controlled passing rates for the Angoff-equated model are exceptionally variable. When both test-taker ability and item difficulty are controlled and passing rates remain in flux, it can be concluded that no construct has been defined. Annual passing rates demonstrate a similar condition. When reviewing annual pass rates for Angoff, it is helpful to use both the Angoff equating line shown on Figures 1 through 5 and the equated Angoff standard itself. Although in theory the annually set standards should pass test takers at rates that would be very close to the Angoff equating line, Angoff scholars may argue that this practice unfairly holds the model to a standard not otherwise anticipated (i.e., equating). Therefore, it may instead be more logical to determine whether or not there is a pattern within the annually set standards

957

Stone et al.

90

85

80

75

70

-----------------------------------------------------------------------------------------------------------------------------

65

60 2004 Objective Equated

2005

2006

Objective Yearly

------- = Objective Equating Line

2007 Angoff Equated

2008 Angoff Yearly

- - - - - = Angoff Equating Line

Figure 3. Passing rate comparison for Medical Examination 3

Note: Passing rates are compared across administration years for both objective standard setting and Angoff, and for both strategies, equated and annual. Circles represent the objective standard setting model passing rates, whereas squares represent passing rates associated with the Angoff model. Darkened symbols indicate the passing rates for the equated standards, whereas clear symbols indicate the passing rates for the annual standards.

that mimics that which is found with the equated. In most instances, the passing rates associated with the annually established Angoff standards matched neither the actual passing rates of the equated standards nor the theoretically correct passing rates of those same standards. In essence, the Angoff model did not present a discernable pattern in any form, indicating this model failed to define a construct. Figures 1 through 5 demonstrate a very different condition with regard to the OSS. Whether equated as intended or annually established, the OSS offers clear and unmistakably meaningful standards and as a result demonstrates effective construct definition.

Discussion Tenopyr (1977) argued that “any inference relative to prediction and . . . all inferences relative to test scores, are based upon underlying constructs” (p. 48). Criterion-referenced

958

Educational and Psychological Measurement 71(6)

Figure 4. Passing rate comparison for Educational Examination 1

Note: Passing rates are compared across administration years for both objective standard setting and ­Angoff, and for both strategies, equated and annual. Circles represent the objective standard setting model passing rates, whereas squares represent passing rates associated with the Angoff model. Darkened symbols indicate the passing rates for the equated standards, whereas clear symbols indicate the passing rates for the annual standards.

standard setting allows us to define through a meaningful exploration of content and a rational quantitative process the nature and scope of minimal required performances (or competencies within defined parameters). When applied successfully, standards allow us to make decisions regarding test-taker qualifications. Our decisions must therefore be based on reliable foundations originating from well-developed constructs. There is little doubt that all standard-setting theorists would agree on this point. What is more contentious however is the question of how to demonstrate the validity of the process and its outcome. Traditional evaluation focuses exclusively on the process, the judges, and the quantified outcome produced (namely, the proportional standard) but fails to capture the inherent and underlying conditions of what might be most important to establish validity. Both Messick (1980) and Borsboom and his colleagues’ (Borsboom, 2005; Borsboom et al., 2004) arguments are well placed. Most

Stone et al.

959

Figure 5. Passing rate comparison for Educational Examination 2

Note: Passing rates are compared across administration years for both objective standard setting and Angoff, and for both strategies, equated and annual. Circles represent the objective standard setting model passing rates, whereas squares represent passing rates associated with the Angoff model. Darkened symbols indicate the passing rates for the equated standards, whereas clear symbols indicate the passing rates for the annual standards.

studies have evaluated parts of the validity process and have lost sight of the more critical holistic understanding as Messick has argued. Furthermore, as Borsboom and his colleagues have outlined, there must be an existing construct to support the validity of any outcome produced. In this case, at the heart of any criterion standard there must be a construct and it is here that studies of validity should begin. The present study compared data from the Angoff and OSS exercises conducted over a 5-year period. Results were unambiguous. Whereas the OSS model defined a consistent and meaningful standard, the Angoff model did not, regardless of the analytical model employed (equated or annually established). Why then have there been so many studies in support of the validity of the Angoff model such that it remains the most often used model worldwide? Again, the reasons reflect Messick’s essential

960

Educational and Psychological Measurement 71(6)

concern. To argue that a particular judge behavior is valid is easy. To select a handful of particular pieces of the model and assess each of them individually is equally simplistic. Adding such simplistic evidence together does not make for a more compelling argument, but simply a longer one. Instead, for these numerous validity studies to produce substantive results, they must focus on the most fundamental of all questions—has the process defined a stable construct? The present study provides one answer to that question—one that affirms the outcome of and also generalizes beyond Stone’s (1996a) original study that focused exclusively on small, medical testing boards. If we are truly interested in creating meaningful and valid standards, then we must explore meaningful and valid questions. The stakes associated with passing and failing test takers grow increasingly greater each day. Students are held behind or not allowed to graduate if they do not reach the passing point, or are allowed to graduate without proper skills based on standards. Professionals are not allowed to practice if they cannot meet minimal performance criteria, or if the minimum criteria are so low, the tests become largely irrelevant. We cannot afford to cling to outdated standard setting models that are not built on solid foundational principles of measurement but rather on missing information held together only by an intricate network of piecemeal validity studies. In doing so, we are playing in a house of cards without a full deck, and the card we are missing is the trump, the construct. Because so much depends on the accuracy of our examination standards, by failing to move beyond the dysfunctional traditions of current practice, it is unfortunately the society in which we live that will continue to be dealt the “joker.” Declaration of Conflicting Interests The authors declared no conflicts of interest with respect to the authorship and/or publication of this article.

Funding The authors received no financial support for the research and/or authorship of this article.

References American Educational Research Association, American Psychological Association, & National Council of Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Angoff, W. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 506-600). Washington, DC: American Council on Education. Beltyukova, S. A., Stone, G., & Ellis, L. W. (2008). Rasch analysis of word identification and magnitude estimation responses in measuring naïve listeners’ judgments of speech intelligibility of children with severe-to-profound hearing impairments. Journal of Speech, Language, and Hearing Research, 51, 1124-1137. Beltyukova, S. A., Stone, G., & Fox, C. M. (2008). Magnitude estimation and categorical rating scaling in social sciences: A theoretical and psychometric controversy. Journal of Applied Measurement, 9, 151-159.

Stone et al.

961

Borsboom, D. (2005). The concepts of validity. In D. Borsboom (Ed.), Measuring the mind: Conceptual issues in contemporary psychometrics (pp. 149-172). Cambridge, England: Cambridge University Press. Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061-1071. Brennen, R. O., & Lockwood, R. E. (1980). A comparison of Nedelsky and Angoff cutting score procedures using generalizability theory. Applied Measurement in Education, 4, 214-240. Camilli, G., Cizek, G. J., & Lugg, C. (2001). Psychometric theory and the validation of performance standards: History and future perspectives. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 445-576). Mahwah, NJ: Erlbaum. Cizek, G. J. (2001a). Conjectures on the rise and call of standard setting: An introduction to concept and practice. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 3-17). Mahwah, NJ: Erlbaum. Cizek, G. J. (2001b). Setting performance standards: Concepts, methods, and perspectives. Mahwah, NJ: Erlbaum. Eadie, T. L., & Doyle, P. C. (2002). Direct magnitude estimation and interval scaling of pleasantness and severity in dysphonic and normal speakers. Journal of Acoustical Society of American, 112, 3014-3021. Ebel, R. L. (1979). Essentials of educational measurement. Englewood Cliffs, NJ: Prentice Hall. Glass, G. V. (1978). Standards and criteria. Journal of Educational Measurement, 15, 237-261. Hood, B. S. (2009). Validity in psychological testing and scientific realism. Theory & Psychology, 19, 451-473. Jaeger, R. M. (1982). An iterative structured judgement process for establishing standards on competency tests: Theory and application. Educational Evaluation and Policy Analysis, 4, 461-475. Kane, M. T. (2001). So much remains the same: Conception and status of validation in setting standards. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 53-88). Mahwah, NJ: Erlbaum. Lewis, D. M., Mitzel, H. C., & Green, D. R. (1996). Standard setting: A bookmark approach. Paper presented at the CCSSO National Conference on Large Scale Assessment, Phoenix, AZ. Lodge, M. (1981). Magnitude scaling: Quantitative measurement of opinions. Newbury Park, CA: Sage. McColl, D., & Fucci, D. (1999). Comparisons of magnitude estimation and interval scaling of loudness. Perceptual and Motor Skills, 88, 25-30. McColl, D., & Fucci, D. (2006). Measurement of speech disfluency through magnitude estimation and interval scaling. Perceptual and Motor Skills, 102, 454-460. Mehrens, W., & Cizek, G. J. (2001). Standard setting and the public good: Benefits accrued and anticipated. In G. J. Cizek (Eds.), Setting performance standards: Concepts, methods, and perspectives (pp. 477-485). Mahwah, NJ: Erlbaum. Messick, S. (1980). Test validity and the measurement of ethics. American Psychologist, 35, 1012-1027. Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological Measurement, 14, 3-19.

962

Educational and Psychological Measurement 71(6)

Norcini, J. J., & Shea, J. A. (1997). The credibility and comparability of standards. Applied Measurement in Education, 10, 39-59. Norcini, J. J., Shea, J. A., & Kanya, D. T. (1988). The effect of various factors on standard setting. Journal of Educational Measurement, 25, 57-65. Plake, B. S., Melican, G. J., & Mills, C. N. (1991). Factors influencing intrajudge consistency during standard setting. Educational Measurement: Issues and Practice, 10, 15-16. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Dankmarks Paedagogiske Institut. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests (expanded ed.). Chicago, IL: University of Chicago Press. Saunders, J. C., & Mappus, L. L. (1984). Accuracy and consistency of expert judges in setting passing scores on criterion-referenced tests: The South Carolina experience. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA. Schulz, E. M., & Mitzel, H. C. (2009). A mapmark method of standard setting as implemented for the National Assessment Governing Board. In E. Smith & G. E. Stone (Eds.), Criterion referenced testing: practice analysis to score reporting using Rasch measurement: 194-235. Maple Grove, MN: JAM Press. Sechrest, L. (2005). Validity of measures is no simple matter. Health Research and Educational Trust, 40, 1584-1604. Southwood, M. H., & Flege, J. E. (1999). Scaling foreign accent: Direct magnitude estimation versus interval scaling. Clinical Linguistics & Phonetics, 13, 335-349. Stone, G. (1995a). Objective standard setting (Doctoral dissertation, The University of Chicago). (UMI No. 9618495) Stone, G. (1995b). Objective standard setting. Paper presented at the national meeting of the American Educational Research Association, San Francisco, CA. Stone, G. (1996a). Informing objectively derived criterion standards: No more smoke and mirrors. Paper presented at the national meeting of the American Educational Research Association, New York, NY. Stone, G. (1996b, April). The construction of meaning: Replicating objectively derived standards. Paper presented at the national meeting of the American Educational Research Association, New York, NY. Stone, G. (2001). Objective standard setting (or truth in advertising). Journal of Applied Measurement, 1, 187-201. Stone, G. (2003). Setting mastery within objective standard setting. Rasch Measurement Transactions, 3, 919-920. Stone, G. (2004). Objective standard setting in understanding Rasch measurement (R. Smith & E. Smith, eds.). Maple Grove, MN: JAM Press. Tenopyr, M. L. (1977). Content-construct confusion. Personnel Psychology, 30, 47-54. Wright, B. D., & Gross, M. (1970). Setting performance standards. Unpublished manuscript. Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago, IL: MESA Press. Zraick, R. I., & Liss, J. M. (2000). A comparison of equal-appearing interval scaling and direct magnitude estimation of nasal voice quality. Journal of Speech, Language, and Hearing Research, 43, 979-988.

Suggest Documents