Grounding the argument-based framework for ...

6 downloads 716 Views 86KB Size Report
John W. Oller, Jr. ... The assessment developer is expected to make a positive case for a proposed ... http://journals.tc-library.org/index.php/tesol/article/view/73).
Oller, J. W., Jr. (2012). Grounding the argument-based framework for validating score interpretations and uses. Language Testing, 29(1), 29-36. (This is a prepublication copy per ROMEO green classification by the publisher, Sage, published version available at http://ltj.sagepub.com/content/29/1/29.)

Grounding the argument-based framework for validating score interpretations and uses_1 John W. Oller, Jr. University of Louisiana, USA Abstract Kane’s argument-based framework is summarized and examined. He implicitly appeals to the backgrounded concepts of fairness and justice. From there it is a short distance to grounding the whole system in the mundane notion of truth. In fact, valid argument systems must depend on representations that are ‘true’ by virtue of agreement with purported facts. As a friendly amendment, therefore, I argue that (provided the ceteris paribus, all else being equal, requirement is met) agreement with known facts in testing, experimental research, and scientific measurement counts for a great deal more than disagreement. It follows by Peircean ‘exact logic’ that higher test scores (if the tests have any validity at all) are invariably more informative (interpretable in general) and thus more useful than lower scores. Why? Because higher scores show more agreement between the test-makers and the higher scoring test-takers about whatever facts (or performances) may be at issue. Exceptions are cases where the ceteris paribus requirement is not met. Necessary (but testable) inferences follow for interpretations and uses of ‘cut scores.’ Keywords argument based framework, connectedness, cut scores, episodic organization, meaningful sequence, pragmatic grounding, true narrative representations, TNR theory, validity

Let me begin by summing up what I think are the critical points of Kane’s broad and practical approach to validating score interpretations and uses. Then, I will focus on what I believe are the underlying principles which give the framework its substance and, finally, I propose a friendly amendment with respect to cutscores. The argument-based framework Kane’s purpose, as I understand it, was to explain his two-step argument-based framework (Kane, 1992). That framework (generalized) aims first to detail how test scores are to be interpreted and what decisions are to be based on them, and, second, to assess the ‘plausibility of the proposed interpretations and uses’ (p. 1). The term ‘plausibility’ nuances any claim of ‘truth’ about facts that can be independently examined, but, and I think Michael Kane may agree here, to keep ‘plausibility’ arguments from devolving into mere voting or opinion surveys, it is essential that the implicit agreement be about independently verifiable facts. Thus, if ‘plausibility’ is to have its usual sense in the sciences, it must be about more than the mere consensus of persons. It has to get down to the material world which is, however it is, independent to a considerable extent of what may be claimed or thought about it. That said, the down to earth approach elaborated by Kane draws on many excellent sources, consolidating and extending a practical conception of ‘validity’ in a way that is reminiscent of experimental hypothesis testing in the sciences on the one side and legal contests -1-

in a courtroom on the other. Kane’s summary of the history of the definition and assessment of validity is masterful, informative, and engaging. The essential thesis of the whole is charmingly captured in the conversation between Alice of Wonderland and the unflappable Humpty Dumpty. Alice’s naivete is counterbalanced by Humpty’s flawed sagacity. How, indeed, could ‘impenetrability’ mean that she should move on to another subject? But Humpty Dumpty justifies the strange use of the word by suggesting the very thesis that Kane presents for consideration: Whenever we want to make a theory, measurement, or test do more work, we are obligated to provide stronger evidence in its favor. Figure 1 of Kane’s paper (p. 000) expresses the essence of the argument-based framework in the formula from Toulmin (1958): Given certain ‘backing,’ a ‘warrant’ (a system of theoretical arguments) about a certain ‘Datum’ (a body of facts placed in evidence) , provided certain requirements are met (what Toulmin refers to as the ‘Qualifier’) we justify a ‘Claim’ (a conclusion or warranted inference), though there may be certain ‘exceptions’ (which should be iterated). Kane sums up: The specification of an interpretive argument puts a definite proposal on the table. A critic gets to challenge the proposal, and a proponent has to defend it. The process is akin to theory testing in science, with the interpretive argument playing the role of the theory under investigation. The assessment developer is expected to make a positive case for a proposed interpretation/use by stating the interpretive argument clearly, by demonstrating its coherence, and by providing support for its inferences and assumptions. A critic can challenge the appropriateness of the proposed interpretations and uses of the assessment results, the adequacy of the interpretive argument given the goals of the assessment program, or the plausibility of specific inferences or assumptions. A critic can also challenge a proposed use of assessment scores even if the interpretation of the scores is generally justified (Shepard, 1993), because, as noted earlier, score uses that ‘impinge on the rights and life chances of individuals are inherently disputable’ (Cronbach, 1988, p. 6). (p. 000) I suppose that argument-based validation can be seen as a discursive form of hypothesis testing. Kane relates his argument-based approach to institutional testing and test scores in general. In directing discussion toward language testers, Kane shows how his argument-based framework applies when we are thinking of ‘cut scores’ (e.g. in applications of TOEFL) which are to be justified by ‘performance’ or ‘judgmental standard-setting studies’ (p. 000). He explains that the cut score corresponds to the performance standard, in the sense that persons with scores above the cut score have generally achieved the performance standard, and that persons with scores below the cut score have generally not achieved the performance standard. The backing for the warrant needs to support these two assumptions. (p. 000) The ‘standard-setting studies’ need to justify the categorization of persons above and below the cut score, and thus, to render plausible their use, for example, in college admissions, which might have a lower cut point than, say, qualifying to be a teaching assistant. Similarly, a -2-

series of cut points at successive levels might be applied for exemption from or placement in some level of coursework designed to teach English as a second language. Many other uses of institutional scores can be conceived, but it is clear that the argument-based framework is applicable to them all, with the caveat that its conclusions need to incorporate a rational consideration of exceptions per the Toulmin formula. Pragmatic grounding Realizing that Kane sees his argument as ‘pragmatic’ in the sense of its being down to earth, sensible, and practical, and given also that he has embraced part or all of the best traditions of measurement theory from many contributors, it is with admiration for the excellent work he has digested and summarized so masterfully that I want to discuss grounding the framework in certain indefeasible formal arguments. The formal grounding also shows the argument-based framework as complementary to what otherwise might be mistaken for a competing system of thought. Borsboom, Mellengren, and van Heerden (2004) argued that ‘a test is valid for measuring an attribute if and only if (a) the attribute exists and (b) variations in the attribute causally produce variations in the outcomes of the measurement procedure’ (p. 1061). They oppose defining validity with reference to whether the empirical relations between test scores match theoretical relations in a nomological network (Cronbach & Meehl, 1955), and . . . whether interpretations and actions based on test scores are justified . not only in the light of scientific evidence but with respect to social and ethical consequences of test use (Messick, 1989). (Borsboom et al., 2004, p. 1061) In 2005, we proposed a resolution of the apparent differences (Badon, Oller, Yan, & Oller, 2005, retrieved July 27, 2011, from http://journals.tc-library.org/index.php/tesol/article/view/73). We contended that ‘truth’ in its simplest and most mundane sense is critical to any valid notion of validity. With that in mind, Xi’s argument for ‘the integration of fairness into validity’ is inevitable. Xi argues that ‘impartiality and justice of actions and comparability of test consequences are at the core of fairness’ (2010, p. 167). Grounding validity Proofs in mathematical logic (Peirce, 1897; Tarski, [1936]/1956, [1944]/1949) show that all meaningful sign systems are initially grounded in true representations. In its least burdened sense, ‘truth’ of the requisite sort (also argued on an ‘exact logic’ Peircean basis, Oller, 1996, 2005) is found only in true narrative representations (TNRs). A TNR is merely the sort of representation that is as true as it purports to be of whatever facts it purports to represent. The gist of the formal proofs can be illustrated this way. Suppose someone says, ‘I doubt X,’ where X is filled in any way we like, for example ‘that any TNRs exist’ or, if Jacques Derrida were the speaker, he might be saying, ‘I doubt that I exist.’ Descartes might say, ‘I doubt that I can think without being’ and so forth. Granting only that some such statement is intelligible at least in self-reference by the speaker, it comes out that any such statement contains a TNR. Descartes could have amplified his well-known aphorism, ‘I think, therefore, I am’ into a fully satisfactory deduction by saying, ‘Every valid self-reference is a TNR. I am speaking. Therefore, this statement is a TNR.’ Thus, the existence of TNRs is proved. To build any argument against their existence requires -3-

instantiating at least one exemplar. All such arguments fail by strict deductive logic more certainly than that the numbers 1, 2, 3, and so on, contain the number 1. The latter proof by induction, involves an iteration that cannot be completed, but the proof of the existence of TNRs is deductive and complete. Reject their existence and all meaningful signs and all reasoning are lost. Also, TNRs provide the only escape from ‘infinite regress or circularity’ (both ‘endless processes’ to borrow a few words from Kane and Cronbach). However, by admitting TNRs, we also discover the amazing realm of their unique logical properties. No matter how many gradations of fictions, errors, lies, and nonsense may be defined in addition to TNRs, it follows by strict mathematical logic that only TNRs determine particular facts in the material world and are thus connected through that world with each other, and are also generalizable to all similar contexts of experience exactly to the limit of the similarities of their facts. A TNR need not represent all possible details of the facts it is about, but it must agree with the facts it purports to represent to the full extent of their conventional meanings. For instance, if I assert (in a TNR) that there is a computer here on the desk in front of me, I am not obliged to report all the other objects, the time of day, what I had for breakfast, and so forth, for my statement to be a TNR, and nothing in the proofs hinges on the selection of any particular TNR. All else being equal, any TNR is logically more in agreement with whatever facts it purports to be about than (a) any fiction (which purports to represent at least some non-factual content, for example suppose I imagine an elephant where my computer sits on the desk); (b) any error (which is a fictional representation mistaken for a TNR, for example suppose someone thinks there is an elephant where there is in fact a computer instead); or (c) any lie (suppose a known murderer denies being at the scene of the crime). By strictly formal comparisons (all else being held equal) it can be rigorously proved that TNRs are the only systems of representation (among all that are possible) that are relatively perfected: (1) in singling out particular facts in the material world; (2) connecting representations validly with the material world and thus with each other by inference; and (3) generalizing to all similar facts up to the limit of the similarities. The proofs of the uniqueness of these logical properties of TNRs can be developed in a variety of ways. They flow seamlessly from the fact that only TNRs (all else being equal) consist of relatively complete three part argument systems. TNRs consist of (i) conventional symbols mapped appropriately (i.e. according to the requisite conventions) through (ii) particular actions (indexes) onto (iii) factual states of affairs or sequences of events in the material world. At their basis, TNRs connect through the sensory experience of one or more competent observers with the material world. The distinctive character of TNRs resides in the fact that they alone (by contrast with fictions, errors, lies, and nonsense) possess relatively complete agreement between all three of their essential component systems. A skeptic might doubt that we have considered all possible representational systems. But we have, more certainly than the fact that the number 2, as in 2 apples, contains the number 1, as in 1 apple, and so forth. Meaningful fictions can only be derived from TNRs, errors from intelligible fictions mistaken for TNRs (but only known to be errors to the extent they can be shown to be imperfect by comparison with TNRs), and lies are errors deliberately misrepresented to be TNRs. For lies to work, they must resemble TNRs. As for nonsense, it exists only by virtue of its superficial resemblance to conventional representational systems -4-

which are acquired and get their distinctive valences from TNRs. If, for example, to pick out an arbitrary contrast, say, the /t/ of ‘coat’ were not distinct from the /k/ of ‘coke’ (and any other contrast would serve as well) as it is in many TNRs, the contrastive values of the phonemes themselves would be undiscoverable. Finally, because any system of representation more complete than any given TNR could only be another TNR, and because any system more incomplete than utter nonsense would be no system of representation at all, it follows that the proofs concerning the unique logical properties of TNRs are complete. They cover all possible sign systems. From the connectedness of ordinary TNRs transitive inferences are justified. For instance, suppose someone truthfully reports getting up in the morning. We are justified in supposing that the night has passed and that in a few hours it will be afternoon, and so forth. What is the warrant for such inferences? If the report of getting up in the morning is merely true with respect to the facts that it purports to be about, then the rest follows from the connectedness of TNRs. Indeed, with respect to temporal sequences of events, it has been strictly proved that the simplest possible narratives are ones that conform to chronological organization. Moreover, as demonstrated in many different empirical contexts the ecological validity of language tests and measures of discourse processing in general is enhanced by respecting narrative-like episodic organization and it is reduced by disrupting it (Oller, Chen, Oller, & Pan, 2005). In asking why, we discover a proof that valid measurements of any kind depend on the episodic organization of ordinary experience. Details are given in Oller and Chen (2007). Succinctly put, no valid nomination (naming) of any particular identity (say, an individual person, score on a test, the occasion of testing, the test booklet, an entry in a spreadsheet, or a sample of similar entries, or whatever nomination of particulars we might select) can be achieved apart from the episodic organization of experience. The same argument holds not only for the nomination of some logical object (a grammatical argument), but also for any predicate of any degree of complexity or abstractness that might be associated with it. A strict proof is produced by showing that the distinct entity to be named (together with any valid predicates) cannot be distinguished from similar (but distinct) entities in less than a four-dimensional manifold. A timeless representation in space cannot distinguish similars (e.g. consider replicas which are almost completely identical objects or events, twins, cattle, ships, cars, explosions, etc.) well enough to determine the particular identity of any one object (or person) to be named (as in what Stevens, 1968, called a ‘nominal’ scale). Without some knowledge of their histories over time, very similar objects cannot be validly distinguished. Leave time out entirely and it will be impossible to tell whether Michael Kane has merely changed his position in space or whether he has ever so many identical clone replicas. Although Michael might appear differently than John Oller across distinct locations, for any observer to look back and forth between any two occasions or locations also requires time. Without it, John and Michael can neither be compared nor distinguished, and, thus, identity (valid nomination) cannot be determined. But, add the dimension of time into the world of facts and their representations, as in TNRs, and valid nomination (together with resolution of associated abstract and complex predicates) becomes possible. Next, to complete the proof for all possible scales of greater complexity than nominal ones, we first show that the dependence of nominal validity on valid episodic organization -5-

generalizes to all possible ordinal scales (Stevens, 1968). For instance, it generalizes to sequences such as 1, 2, 3, . . . A, B, C, . . . or to any ordinal scale. To set the elements of any such scale in the required transitive relation (a ranked order) such as is found in a temporal sequence, or, say, in a left-to-right spatial arrangement, the differentiation of gradations depends on multiple valid nominal distinctions. Therefore, ordinal scales require episodic organization for the same reasons already iterated for nominal ones. Finally, we show that the same requirements for nominal and ordinal scales generalize to interval, ratio, and all possible higher scales . thus to all measures with the slightest claim to validity whatsoever and the proof is complete. It hardly seems necessary to mention that agreement between representations and facts is the essence of ordinary truth, but a couple of not so obvious inferences follow for cutscores in testing and for experimental results in the sciences. These generalize to validation studies of score interpretations and their uses. Consider the difficulty of mapping an arbitrary conventional symbol onto a particular entity in a universe populated by uncountable multitudes of distinct entities that might be singled out to be identified (uniquely determined) in the manner of a TNR. In such an economy, is a hit the same as a miss? The absurdity of regarding misses as having the same probative weight as hits is exactly reflected in regarding low scores on a par with high scores, failed theoretical predictions on a par with successful ones, and disagreements on a par with agreements. Agreement between a representation and its facts, as in a TNR, is worth incomparably more than ever so many disagreements. Thus, ordinary true reports (valid predictions included) involve vastly more interpretability and usefulness than false reports (and than failed hypotheses). It follows by mathematical logic of the strictest kind that in testing, especially in measurements that depend on valid mapping of conventional linguistic signs onto intended meanings, or onto known or inferred facts, provided the ceteris paribus requirement is satisfied, scores above a cut point are more informative than ones below it. Likewise tests, theories, and arguments in general that predict unlikely outcomes correctly are incommensurably more informative and useful than similar efforts leading to results expected by chance (null outcomes). Higher scores and valid predictions, to use the words of Humpty Dumpty, pay better wages than lower scores and invalid predictions. Also, the advantage rises toward a limit of relatively perfect reliability and validity with increases in agreement. Although this inference is not stated in Kane’s insightful paper, it may be mentioned elsewhere in his work. In conclusion When Charles Stansfield observed in his lifetime award lecture that back in the 1970s I had ‘focused on the test as a whole, and attempted to define the construct that it measured,’ he pointed toward inferences grounded in ‘episodic organization.’ Such narrative-like transitivity is also found in experimental reports, conversational exchanges, debates and legal argumentation, and in experience in general. It connects the argument-based validation framework on the one hand with particular facts, the concrete side of TNRs, and, on the other with generalizable representations, the abstract side of TNRs. It connects all coherent forms of discourse including fictions, errors, and lies exactly to the extent that any of them have any ‘plausibility.’ The connectedness of ordinary experience was stressed by John Dewey and before him by C. S. Peirce. My dad called it a ‘meaningful sequence’ as exemplified in his El espanol por el -6-

mundo published by Encyclopedia Britannica Films (Oller, Sr., 1963.1967). The storyline of that filmed series is grounded in the world of La familia Fernandez. At number 54 of the first level of the 101 filmed episodes, Alvaro Fernandez receives a letter from his cousin who is planning a visit to Mexico (see this episode at the Academic Film Archive of North America, retrieved July 27, 2011, from http://www.archive.org/details/OtraCarta). In the 27 following episodes (Emilio en Espana), Alvaro’s brother Emilio travels to Spain to visit grandparents and relatives. Then, in the third series of 20 additional filmed episodes, Emilio and cousin Paco experience famous Spanish sites and cultural events in Coloquios Culturales. It was from those episodically organized teaching materials that my journey as a linguist, theoretician, and language teacher/tester began. Since then it has been possible to demonstrate that episodic organization is essential to language acquisition, teaching, and testing. It appears, in fact, to be essential to valid measurement in general and therefore also to the argument-based framework. Note 1. Michael Kane’s ‘Messick Lecture’ titled ‘Validating Score Interpretations and Uses’ which was delivered orally at the Language Testing Research Conference in Cambridge in April 2010 and appears in this journal under the same title. References Badon, L. C., Oller, S. D., Yan, R., & Oller, J. W., Jr. (2005). Gating walls and bridging gaps: Validity in language teaching, learning, and assessment. Teachers College, Columbia University Working Papers in TESOL Applied Linguistics, 5(1), 1.15. Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061.1071. Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer & H. Braun (Eds.), Test validity (pp. 3.17). Hillsdale, NJ: Lawrence Erlbaum. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281.302. Kane, M. (1992). An argument-based approach to validation. Psychological Bulletin, 112, 527.535. Kane, M. (2011). Validating score interpretations and uses. Language Testing, ?(?), 000.000. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement, 3rd ed. (pp. 13.103.). New York: American Council on Education and Macmillan. Oller, J. W., Jr. (1996). How grammatical relations are determined. In B. Hoffer (Ed.), The 22nd Linguistic Association of Canada and the United States (LACUS) forum, 1995 (pp. 37.88. Chapel Hill, NC: Linguistic Association of Canada and the United States (series Ed.,T. Griffen). Oller, J. W., Jr. (2005). Common ground between form and content: The pragmatic solution to -7-

the bootstrapping problem. Modern Language Journal, 89, 92.114. Oller, J. W., Jr., & Chen, L. (2007). Episodic organization in discourse and valid measurement in the sciences. Journal of Quantitative Linguistics, 14, 127.144. Oller, J. W., Jr., Chen, L., Oller, S. D., & Pan, N. (2005). Empirical predictions from a general theory of signs. Discourse Processes, 40(2), 115.144. Oller, J. W., Sr. (1963.1967). El espa.ol por el mundo (La Familia Fernandez, Primer Nivel; Emilio en Espana, Segundo Nivel; Coloquios Culturales). Chicago: Encyclopedia Britannica Films. Peirce, C. S. (1897). The logic of relatives. The Monist, 7, 161.217. Also in C. Hartshorne & P. Weiss (Eds.), (1932), Collected papers of C. S. Peirce, Vol. 2 (pp. 288.345). Cambridge, MA: Harvard University Press. Shepard, L. A. (1993). Evaluating test validity. In L. Darling-Hammond (Ed.), Review of Research in Education, Vol. 19 (pp. 405.450). Washington, DC: American Educational Research Association. Stansfield, C. W. (2008). Where we have been and where we should go. Language Testing, 25(3), 311.326. Stevens, S. S. (1968). Measurement, statistics, and the schemapiric view. Science, 161(3844), 849.856. Tarski, A. (1949). The semantic conception of truth. In H. Feigl & W. Sellars (Eds. and Trans.), Readings in philosophical analysis (pp. 341.374). New York: Appleton. (Original work published 1944) Tarski, A. (1956). The concept of truth in formalized languages. In J. J. Woodger (Ed. and Trans.), Logic, semantics, and metamathematics (pp. 152.278). Oxford: Oxford University Press. (Original work published 1936) Toulmin, S. (1958). The uses of argument. Cambridge: Cambridge University press. Xi, X. (2010). How do we go about investigating test fairness? Language Testing, 27(2), 147.170.

-8-