Psychological Methods

2 downloads 0 Views 207KB Size Report
Jul 8, 2013 - bridge Local Examinations Syndicate, which includes Cambridge Interna- ..... bell & Fiske, 1959); internal validity, substantive validity, struc-.
Psychological Methods Standards for Talking and Thinking About Validity Paul E. Newton and Stuart D. Shaw Online First Publication, July 8, 2013. doi: 10.1037/a0032969

CITATION Newton, P. E., & Shaw, S. D. (2013, July 8). Standards for Talking and Thinking About Validity. Psychological Methods. Advance online publication. doi: 10.1037/a0032969

Psychological Methods 2013, Vol. 18, No. 3, 000

© 2013 American Psychological Association 1082-989X/13/$12.00 DOI: 10.1037/a0032969

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Standards for Talking and Thinking About Validity Paul E. Newton

Stuart D. Shaw

Institute of Education, University of London

Cambridge International Examinations, Cambridge, England

Standards for talking and thinking about validity have been promulgated in North America for decades. In 1954 two foundational standards were announced: (a) Thou shalt not refer to “the validity of the test” and (b) thou shalt use validity modifier labels, such as “content validity” or “predictive validity.” Subsequently, in 1985, the latter became, thou shalt not use validity modifier labels. These standards for talking about validity have repeatedly been disregarded over the years. Possible reasons include intentional misuse, while upholding standards for thinking about validity; lack of awareness or misunderstanding of standards for thinking about validity; and genuine divergence from standards for thinking about validity. A historical analysis of disregard for these standards provides a basis for reappraising the concept of validity. We amassed a new body of evidence with which to challenge the frequently asserted claim that a general consensus exists over the meaning of validity. Indeed, the historical analysis provides reason to believe that prospects for achieving consensus over the meaning of validity are low. We recommend that the concept of validity be abandoned in favor of the more general, all-encompassing concept of quality, to be judged in relation to measurement aims, decision making aims, and broader policy aims, respectively. Keywords: validity, quality, evaluation, validation, test, assessment

new edition is the product of debate between many subcommittees, representing many subcommunities, and takes many years to develop. The Standards are respected internationally, and conceptions of validity presented in successive editions have been appropriated internationally. Increasingly, in recent years, writers have acknowledged substantial discrepancy between the principles of validity, embodied within these consensus statements, and validation practice evident from the wider literature (e.g., Cizek, Rosenberg, & Koons, 2008; Hogan & Agnello, 2004; Hubley & Zumbo, 1996; Jonson & Plake, 1998; Messick, 1988; Shepard, 1993; Wolming & Wikstrom, 2010). This raises an important question: If measurement specialists have genuinely reached consensus over the concept of validity, then why is there so little evidence of this in validation practice? In the present article, we add to this literature by exploring an apparent disjunction between standards for talking about validity and how validity is actually talked about in the published literature (our use of “talking” includes written text). Our intention is to mark a subtle distinction between standards for talking about validity and standards for thinking about validity. As we will explain shortly, the Standards contain both specific standards for talking about validity and more general standards for thinking about validity. Standards for thinking about validity specify how it ought to be understood (i.e., the accepted meaning of the concept). Standards for talking about validity specify how it ought to be expressed or articulated. The latter clearly follow from the former. Indeed, the point of standards for talking about validity would seem to be to emphasize, or to underline, associated standards for thinking about validity. In short, scientists and professionals ought to talk properly about validity in order that they, and others, continue to think properly about validity.

The notion of “standards” at the heart of this discussion is intended to capture the idea of consensus, within a community, concerning how its members ought to behave. Within scientific communities, standards are often expressed implicitly (i.e., as paradigms through which knowledge is constructed). Within professional communities, standards are often expressed explicitly (i.e., as codes of practice). Whether implicit or explicit, standards are fundamental to communities because they enable individuals to function collectively (i.e., to function as communities). Since the 1950s, the American Psychological Association (APA), the American Educational Research Association (AERA), and the National Council on Measurement in Education (NCME) have collaborated in the development of standards for educational and psychological testing (known, hereafter, as successive editions of the Standards), with an intention “to promote the sound and ethical use of tests and to provide a basis for evaluating the quality of testing practices” (AERA, APA, & NCME, 1999, p. 1). Each edition has contained a consensus statement on validity, which has evolved over time as the field has developed. Each

Paul E. Newton, Department of Curriculum, Pedagogy and Assessment, Institute of Education, University of London, London, England; Stuart D. Shaw, Cambridge International Examinations, Cambridge, England. We are very grateful to Cambridge Assessment (University of Cambridge Local Examinations Syndicate, which includes Cambridge International Examinations) for supporting the preparation of this article. Correspondence concerning this article should be addressed to Paul E. Newton, Department of Curriculum, Pedagogy and Assessment, Institute of Education, University of London, 20 Bedford Way, London WC1H 0AL, England. E-mail: [email protected] 1

NEWTON AND SHAW

2

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

The present article will focus upon two of the most fundamental of standards for talking about validity, both derived directly from the Standards, which we refer to colloquially as 1.

Thou shalt not refer to “the validity of the test” (TVOTT), that is, as though validity were a property of tests.

2.

Thou shalt (not) use validity modifier labels (VMLs), that is, terms like content validity and predictive validity (there is a not in parentheses because it was promoted by the first three editions yet rejected by the fourth and fifth).

These two standards are intimately, albeit confusingly, intertwined. We will demonstrate how they have repeatedly been disregarded, providing a basis for reflecting upon the desirability and viability of standards for thinking and talking about validity. We will conclude from our historical analysis that prospects for reaching consensus over the meaning of validity are low. This is epitomized by the fact that the field has been unable to reach agreement over whether the concept of validity ought to embrace the evaluation of measurement aims alone; the evaluation of measurement and decision making aims; or the evaluation of measurement, decision making, and broader testing policy aims. Our recommendation, faced with this enduring lack of consensus, is to abandon the concept of validity in favor of the broader concept of quality, applicable less contentiously across the three principal evaluation foci just mentioned.

Validity Means Different Things to Different Communities This article is concerned with standards for talking about validity, as a point of focus for the more general issue of standards for thinking about, that is, for conceptualizing, validity. It therefore concerns what is meant by validity within a particular community. The community at the heart of this discussion is an extremely broad one: the supracommunity of educational and psychological measurement (EPM). It embraces scientists with a remit for measurement within academic settings and professionals with a remit for measurement within practical settings. It includes experimental psychologists, clinical psychologists, educational psychologists, guidance counselors, test developers, personnel psychologists, test regulators, and many more. Implicitly, the reach of this supracommunity is even broader, because it ought to extend to anyone, academic or practitioner, who relies upon measurement in an educational or psychological context. This would, for example, include many experimental psychologists who would not specifically consider themselves to be measurement scholars. It might also extend to those within other fields of social science research, where similar kinds of measurement procedure are relied upon. Although the more general relevance of this thesis should be appreciated, the article is framed in terms of the explicit development of standards by the mainstream EPM supracommunity. Specifying this focus is important because standards of validity for EPM differ significantly from standards of validity across other communities of practice. For instance, within the community of formal logicians, validity refers to deductive arguments, such that an argument is valid if and only if it is not possible for all its

premises to be true when its conclusion is false. Validity is defined differently across communities as disparate as law (e.g., Austin, 1832/1995; Waluchow, 2009), economics (e.g., MacPhail, 1998), pattern recognition (e.g., Halkidi, Batistakis, & Vazirgiannis, 2002), genetic testing (e.g., Holtzman & Watson, 1997), and management (e.g., Markus & Robey, 1980), to name but a few. More confusingly, there are standards for validity within education and psychology that are not specific to measurement. This is to draw a distinction between validity for research and validity for measurement. The former is relevant whenever conclusions are to be drawn on the basis of research evidence. The latter is relevant only for conclusions that relate specifically to measurement. Validity for research has been theorized from both quantitative (e.g., Bracht & Glass, 1968; Campbell, 1957; Campbell & Stanley, 1966; Cook & Campbell, 1979) and qualitative (e.g., Kvale, 1995; Lather, 1986, 1993; Maxwell, 1992) perspectives.

Validity Standards The end of the 19th century and the beginning of the 20th century witnessed a huge expansion in the science and practice of EPM, especially in the United States. Inevitably, questions were raised over the quality of some of the new developments, and committees exploring the need for greater standardization and control were established by the APA as early as 1895 (under Cattell) and 1906 (under Angell). As explained by Fernberger (1932), early attempts at control were largely unsuccessful. Some years later, the Standardization Committee of the North American National Association of Directors of Educational Research surveyed its membership with the intention of establishing consensus on the kind of information that could demonstrate the superiority of one test over another. It provided tentative definitions for terms like scale, average, performance, and so on, and proposed a process for standardizing tests, which included the determination of both validity and reliability. It defined these operations thus: Two of the most important types of problem in measurement are those connected with the determination of what a test measures, and of how consistently it measures. The first should be called the problem of validity, the second, the problem of reliability. (Buckingham et al., 1921, p. 80)

Thirty years later, the APA took the lead again, establishing a Committee on Test Standards, to be chaired by Lee J. Cronbach. As explained in an initial draft prepared for consultation, it was given a remit to prepare “an official statement of the profession” concerning standards of reporting information about tests (see APA, 1952, p. 461). Importantly, its final draft, published 2 years later, was prepared by a joint committee of the APA, the AERA, and the National Council on Measurements Used in Education (NCMUE; APA, AERA, & NCMUE, 1954). This represented the very first consensus statement on such matters from the EPM supracommunity. It included sections on dissemination, interpretation, validity, reliability, administration and scoring, scales and norms. The section on validity, within the first edition of the Standards, was by far the largest. It included an introductory text, within which validity was defined and explained (pp. 13–18), followed by 19 validity standards (pp. 18 –28), the very first of which was a standard for talking about validity:

STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY When validity is reported, the manual should indicate clearly what type of validity is referred to. The unqualified term “validity” should be avoided unless its meaning is clear from the context. (APA et al., 1954, pp. 18 –19)

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

In subsequent comments, it went on to add that no manual should use a blanket statement like “this test is valid,” and this requirement was elaborated in the second edition: C1.1. Statements in the manual about validity should refer to the validity of particular interpretations or of particular types of decision. ESSENTIAL [Comment: It is incorrect to use the unqualified phrase “the validity of the test.” No test is valid for all purposes or in all situations or for all groups of individuals.] (APA, AERA, & NCME, 1966, p. 15)

This brief comment appeared almost word for word in every subsequent edition of the Standards. Note that there was no hedging here: The standard was deemed essential and the expression “It is incorrect” was uncompromising. It is important to appreciate that this single standard for talking about validity spawned two quite distinct conventions: one that remains a fundamental standard for talking about validity even to the present-day (thou shalt not refer to TVOTT) and one that is now seen as a relic of former times (thou shalt use VMLs). Because the story of the latter is less well documented than the former, the convoluted history of the VML will be discussed below in some detail, followed by a briefer and more straightforward account of why it is still considered inappropriate to refer to TVOTT.

Thou Shalt (Not) Use VMLs A generally accepted principle of EPM, evident since at least the first few decades of the 20th century, was the idea that scores from a single test might be interpreted in different ways when used for different purposes (see Newton, 2012a). As explained in the first edition of the Standards, a vocabulary test might be interpreted as measure of “present vocabulary” in one context, to make one kind of decision, but in terms of “intellectual capacity” in another, to make a different kind (APA et al., 1954, p. 13). It was for precisely this reason that the very first standard for talking about validity insisted that test manuals should clearly mark distinctions between different types of validity. Four types of validity were proposed, which mapped onto four aims of testing, which involved four types of interpretation. The four aims of testing were (a) to determine how an individual would perform at present in a given universe of situations (content validity), (b) to predict an individual’s future performance on an external variable (predictive validity), (c) to estimate an individual’s present status on an external variable (concurrent validity), and (d) to infer the degree to which an individual possesses a trait (construct validity). Thus, for example, content validation would be required in order to defend an interpretation in terms of present vocabulary, whereas construct validation would be required in order to defend an interpretation in terms of intellectual capacity. The first validity standard therefore required the use of VMLs in order to make explicit the kind of interpretation that had been validated. Although VMLs had appeared frequently in the literature on EPM since at least the 1930s, this new usage was somewhat different and somewhat more sig-

3

nificant, as we shall now explain. We shall discuss the use of VMLs, in relation to the Standards, within three phases. 1930s to 1953. The use of VMLs can be found in the literature as early as the 1930s. Watson and Forlano (1935), for instance, spoke of prima facie validity; Woody and others (1935) referred to curricular validity; and Richardson (1936) discussed differential validity. A decade later, the concept of face validity was considered in some depth by both Rulon (1946) and Mosier (1947). Perhaps the first scholars to have used VMLs to deconstruct the concept of validity were Greene, Jorgensen, and Gerberich (1943), who distinguished between three kinds of validity: curricular validity, statistical validity, and psychological and logical validity. Guilford (1946) cut the validity cake somewhat differently, suggesting that it came in two kinds: factorial validity and practical validity. Cronbach (1949), in the first edition of his classic textbook, Essentials of Psychological Testing, distinguished two “basic approaches” based upon logical and empirical analysis. In that same edition, he referred not only to empirical validity and logical validity, but also to factorial validity and curricular validity. It is worth noting that early classifications using the VML formulation tended not to draw a clear distinction between different kinds of validity and different approaches to validation. For instance, Greene et al. (1943) referred to their three categories as both “types of test validity” (p. 54) and “types of methods” (p. 55). 1954 to 1984. The use of VMLs was formalized through the work of the committee that developed the first edition of the Standards (APA et al., 1954). The committee identified four types of validity: predictive, concurrent, content, and construct. From first to second edition, predictive validity and concurrent validity were combined within a single category: criterion-related validity. The first three editions presented somewhat mixed messages concerning the nature of validity. All three referred both to “types” and to “aspects” when describing their VMLs; the former suggesting fairly sharp dividing lines, and the latter suggesting the converse. It seems fair to conclude, however, that the first three editions of the Standards were generally read to be describing types rather than aspects (see, e.g., Guion, 1980). This seems consistent with the idea that different approaches to validation were required for different kinds of interpretation. 1985 to present-day. It was against this fragmented view of validity that Messick (1975) championed a revolution. He insisted that talking about different kinds of validity—and marking such distinctions through the use of VMLs—was extremely misleading and had the potential to impact adversely upon validation practice. As he explained in two influential articles (Messick, 1980, 1981), important distinctions might become blunted, meaning that superficially similar categories are confused (e.g., content validity and construct validity), leading to confusion in evidence gathering; uniqueness might become elevated, such that one kind of validity (e.g., content validity), or a small set, might be treated as the whole of validity; and differences in importance might be overlooked, especially the supporting role played by content and criterion concerns to construct validation. The fourth edition of the Standards (AERA, APA, NCME, 1985) was clearly influenced by Messick. It stated explicitly that validity was a unitary concept, and although it did not formulate the rejection of the VML as an explicit validity standard, its decision to refer to content-related, criterion-related, and construct-related evidence of validity established that standard

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

4

NEWTON AND SHAW

implicitly. Both the fourth and fifth editions distinguished clearly between aspects and types of validity. They accepted that different kinds of evidence illuminated different aspects of validity, but insisted that the different kinds of evidence were not linked to different types of validity because there was now only one type of validity (i.e., construct validity). This was the foundation of a new creed for the EPM supracommunity—modern (Unitarian) validity theory, we might say, as opposed to traditional (Trinitarian) validity theory. The definition of validity in the fifth edition of the Standards was essentially an homage to Messick (1989). It reflected not only the depth and sophistication of his thesis, but also his occasional confusion (Newton, 2012a). It dropped the traditional three labels entirely and referred instead to evidence based upon test content, response processes, internal structure, relations to other variables, and consequences of testing (AERA et al., 1999). Its glossary noted that because all validity is essentially construct validity, even the modifier construct was now redundant. Thus, the use of VMLs was officially abandoned, and a fragmented conception of validity was officially replaced by a unified one. In summary, during the early years, prior to 1954, there were no official statements concerning the use of VMLs. From 1954 to 1984, there were explicit standards for using VMLs, and these were exemplified in the introductory text of the first three editions of the Standards. The first edition divided validity into four types— but only four types—reflecting the four aims of testing. The second and third edition collapsed these into just three, which were deemed sufficient to cover the full range of possible interpretations of test scores. From 1985 the Standards recognized only one kind of validity, meaning that VMLs were officially rejected. As explained in the glossary of the fifth edition, the only VML with any remaining claim to legitimacy was construct validity, yet even this label was now superfluous.

Thou Shalt Not Refer to the Validity of the Test The use of VMLs followed from the principle that conclusions concerning validity are never general but relate to specific interpretations. Thus, an interpretation of test scores in terms of present vocabulary might be valid, whereas an interpretation of the same test scores in terms of intellectual capacity might be invalid. In the same way, it was accepted that different conclusions concerning validity might follow for different groups of individuals, or for different situations within which individuals or groups found themselves. In short, it is never the test that is to be judged valid or invalid, in a general sense, but the interpretation of test scores as measures of a specific attribute under specified conditions. Although the use of VMLs was officially rejected in the mid1980s, the principle from which it was originally derived remained intact. Thus, it remained a fundamental tenet of modern validity theory that validity related to the interpretation of test scores and not to the test itself. If results from a single test were to be interpreted in terms of different attributes, then each interpretation would need to be validated independently. What changed was the assumption that different approaches were required to validate different kinds of interpretation: Modern validity theory decreed that construct validation was required for all interpretations. In

short, consensus over the inappropriateness of referring to TVOTT was never shaken.

The Importance of Consensus The idea of the Standards as a consensus position was fundamental from the outset, and each new edition reaffirmed this principle. The explicit foci for consensus were, presumably, the standards themselves, although, by implication, it seems reasonable to conclude that consensus was also reached on the introductory text that accompanied each section, which elaborated points of principle from which the standards were derived. Although the fourth edition said of the introductory text that it should not be interpreted as imposing additional standards, it seems hard to avoid the conclusion that in promulgating a particular view of validity, standards for thinking about validity were established just as much by the introductory text as by the validity standards themselves. Note, for instance, the uncompromising style adopted by successive editions, illustrated in the opening sentences of successive validity sections: Validity information indicates to the test user the degree to which the test is capable of achieving certain aims. (APA et al., 1954, p. 13) Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests. (AERA et al., 1999, p. 9)

There is no hint here that the views expressed might represent a tentative consensus or a compromise position, or even that there might be any doubt at all over their legitimacy. No such hints are to be found in other passages or other editions. The Standards therefore present explicit (practical) standards, including standards for talking about validity, prefaced by more implicit (conceptual) standards for thinking about validity. Over the past couple of decades, the claim that there now exists a new consensus over the nature of validity— embodied in the fourth and, especially, the fifth edition— has repeatedly been asserted (e.g., Angoff, 1988; Cronbach, 1989; Downing, 2003; Dunnette, 1992; Kane, 2001; Shepard, 1993; Sireci, 2009). Moss (1995, p. 6) went so far as to describe this as “a close to universal consensus among validity theorists.” This alleged consensus reasserts the traditional principle that it is wrong to refer to TVOTT because tests are not the kind of thing that can be valid or invalid. It also asserts that there is now only one kind of validity, construct validity, which renders the use of VMLs inappropriate.

Validity Custom and Practice In response to this assertion of consensus, we now present a new body of evidence on the way in which VMLs have been used over the years, particularly during the two key phases identified above (pre- and post-1985). It highlights a disjunction between standards and custom and practice, that is, between how VMLs ought to have featured in the literature of EPM and how they actually did. This is followed by a much shorter section that simply highlights the more widely acknowledged fact that members of this supracommunity are still wont to refer to TVOTT. Possible reasons for these disjunctions are considered subsequently.

STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

The Proliferation of VMLs Prior to 1985 In addition to those VMLs mentioned above, a range of different kinds of validity had been proposed even before the publication of the first edition of the Standards. These included intrinsic validity (Gulliksen, 1950); internal validity and external validity (Guttman, 1950); synthetic validity, generalized validity, and situational validity (Lawshe, 1952). Others were proposed shortly afterward, including convergent validity and discriminant validity (see Campbell & Fiske, 1959); internal validity, substantive validity, structural validity, and external validity (see Loevinger, 1957); trait validity and nomological validity (see Campbell, 1960). Although the Standards were never intended as a textbook, it still seems a little odd that the early proliferation of VMLs was not explicitly recognized in the 1954 edition, let alone the 1966 revision. Indeed, although new types of validity continued to be introduced in the wake of the first edition, none was incorporated in the second or the third. Admittedly, a footnote to the 1974 revision did, at least, allude to developments within the wider literature: Many other terms have been used. Examples include synthetic validity, convergent validity, job-analytic validity, rational validity, and factorial validity. In general, such terms refer to specific procedures for evaluating validity rather than to new kinds of interpretative inferences. (APA, AERA, & NCME, 1974, p. 26)

This was not, strictly speaking, correct, though; for instance, even those VMLs listed in the footnote were not simply alternative procedures. Rational validity, for example, was more of an overarching category, akin to logical validity, with links to curricular validity and content validity. Then there were other well-known validities that were neither present on the list nor could properly be described as procedures, such as trait validity and nomological validity (Campbell, 1960), and incremental validity (Sechrest, 1963). Inevitably, we would expect the publication of a statement that claimed to express an official statement of the professions to generate a certain amount of debate and divergent opinion within the wider literature. Not only did this occur, it resulted in the invention of a multiplicity of new VMLs. Cattell (1964, p. 7), for instance, bemoaned the “motley list of ‘validity’ terms” in the Standards. He claimed that in promulgating them, the committee had been unduly successful in establishing a professional consensus, given that the concept was still in its infancy. He argued that several existing uses were either unfruitful (e.g., construct validity) or superfluous (e.g., face validity, predictive validity, concurrent validity, content validity) in the sense of not being central to the concept of validity and better described using other terms. In search of a “more basic set of concepts,” he proposed a suite of new VMLs along three dimensions: concrete validity to concept validity, natural validity to artifactual validity, and direct validity to indirect validity. His proposals had little impact on the wider literature. Nor did the plethora of VML-based taxonomies that were to follow. Cureton (1965), for instance, distinguished between three kinds of criterion validity: raw validity, the correlation between a predictor measure and a (sui generis) criterion measure; true validity, the correlation between a predictor measure and estimated true scores on a (constructed) criterion measure; and intrinsic validity, the correlation between estimated true scores on a predictor mea-

5

sure and estimated true scores on a criterion measure. Lord and Novick (1968) drew a distinction between empirical validity and theoretical validity: empirical validity referring to the degree of association between the focal measurement and some other observable measurement, and theoretical validity referring to the correlation of an observed variable with a theoretical construct or latent variable, of which construct validity was a special case. Carver (1974) contrasted psychometric validity, concerning the identification of cross-sectional differences between individuals, with edumetric validity, concerning the identification of longitudinal changes within individuals over time. Popham (1978) proposed three new types of validity for criterion-referenced tests: descriptive validity, the extent to which the test measures what its descriptive scheme contended that it is measured; functional validity, the extent to which the test fulfilled its intended function; and domain-selection validity, the extent to which the behavioral domain was wisely chosen. Beyond these alternative VML-based taxonomies, many new types of validity were proposed in the period between 1954 and 1984: for instance, domain validity (Tryon, 1957a, 1957b), common sense validity (Shaw & Linden, 1964, from English & English, 1958), occupational validity (Bemis, 1968), cash validity (Dick & Hagerty, 1971), single-group validity (Boehm, 1972), consensual validity (McCrae, 1982, from Rosenberg, 1979), decision validity (Hambleton, 1980), intrinsic rational validity and performance validity (Ebel, 1983), and so on. Thus, despite very clear standards for talking about validity from 1954 to 1984 — which recognized content validity, construct validity, and criterion-related validity but no other VMLs—a very large number of new VMLs came to be proposed. In fact, many of the “biggest hitters” of their day—Campbell, Loevinger, Cattell, Cureton, Lord, Novick, Carver, Popham, Tryon, Hambleton, Ebel, and many others too— contributed to this proliferation.

The Continued Proliferation of VMLs Following 1985 As we shall now demonstrate, the VML formulation continued to be used long after the fourth edition of the Standards had been published. In fact, new VML-based taxonomies and new VML types continued to be proposed too. The continued use of VMLs in the wider literature. To investigate the use of VMLs in contemporary research reports, we analyzed titles of articles, from 22 journals within the field of EPM, that had been published between January 1, 2005, and December 31, 2010.1 This involved using the Internet search engine attached to the official website of each journal, restricted to the specified period, with validity in the title field. Titles including 1 Applied Measurement in Education; Applied Psychological Measurement; Assessment; Assessment and Evaluation in Higher Education; Assessment in Education: Principles, Policy and Practice; Educational and Psychological Measurement; Educational Assessment; Educational Assessment, Evaluation and Accountability; Educational Measurement: Issues and Practice; European Journal of Psychological Assessment; International Journal of Selection and Assessment; Journal of Applied Psychology; Journal of Educational Measurement; Journal of Personality Assessment; Journal of Psychoeducational Assessment; Language Assessment Quarterly; Language Testing; Measurement and Evaluation in Counseling and Development; Measurement in Physical Education and Exercise Science; Measurement: Interdisciplinary Research and Perspectives; Psychological Assessment; and Psychometrika.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

6

NEWTON AND SHAW

VMLs were exported for subsequent analysis. The intention was simply to count how many VMLs appeared in titles of articles from those journals, published between 2005 and 2010. Occasionally, more than one appeared in the same title, whereby all occurrences were counted. As a point of reference, it useful to note that there were 131 titles that referred to validity but without any VML and a further 40 that referred to the validity without any VML. Table 1 presents results for 32 VMLs that appeared in titles from the 22 measurement journals between 2005 and 2010. For five journals, no titles including VMLs were identified.2 For seven journals, more than 20 titles including VMLs were identified.3 A total of 208 uses were identified, or 144 if construct and constructrelated are omitted as allowable VMLs. Two kinds of VML were excluded from this analysis: referent modifiers and simple relational modifiers.4 In the top 13 (most frequently observed VMLs), each with at least two uses, we note one still officially sanctioned VML, construct(-related); four ex-officially sanctioned VMLs, criterion(-related), predictive, concurrent, content; five very well worn VMLs, incremental, convergent, discriminant, factorial, structural; and one new contender, consequential. Outside the top 13, we note a host of more obscure VMLs: some as old as the hills, such as differential, internal, synthetic; others perhaps even making their maiden voyage, such as extratest, operational, elemental. It is tricky to interpret the significance of these results in isolation. So the same analysis was run for the period between January 1, 1975, and December 31, 1980. This period captured articles that would have been written during the pre-Unitarian phase; before Messick (1980) argued that the use of VMLs should be dropped; and before the Standards changed its nomenclature for validity. Unfortunately, there were far fewer measurement journals published back then, so the analysis was restricted to just three.5 The comparison of VML prevalence between 1975–1980 and 2005–2010 was complicated by the fact that validity was referred to more frequently in titles from the earlier years within these three journals. Thus, from 1975–1980, 56 articles referred to validity without mentioning a VML, compared to 31 from 2005–2010; likewise, an additional 41 referred to the validity without a VML from 1975–1980, compared to seven from 2005–2010. Results for each of the three journals are presented in Table 2. For the journal Educational and Psychological Measurement, the picture seems to be one of a reduction in the use of VMLs over time, from 86 to 24. The picture is less clear for the Journal of Applied Psychology and the Journal of Personality Assessment, however, with the Journal of Applied Psychology remaining fairly stable (14 vs. 11) and the Journal of Personality Assessment rising (19 vs. 28). If construct validity and construct-related validity are excluded from these figures, they become 78 versus 14 (Educational and Psychological Measurement), 12 versus 10 (Journal of Applied Psychology), and 12 versus 17 (Journal of Personality Assessment). In summary, although there may be some indication of a possible reduction in the use of VMLs over time, this evidence is not overwhelming, and many can still be found gracing the pages of the most respected measurement journals. Once again, it is important to remember that these figures do not relate to the number of published articles that referred to VMLs; they relate simply to the

Table 1 The Prevalence of Validity Modifier Labels Within Recent Journal Articles Label

Frequency (n ⫽ 208)

%

Construct Incremental Predictive Convergent Discriminant Criterion-related Concurrent Criterion Factorial Construct-related Structural Content Consequential Differential Internal Cross-cultural CrossExternal Population Consensual Diagnostic Extratest Incremental criterion-related Operational Local Concurrent criterion-related Criteria Cross-age Elemental Predictive criterion-related Synthetic Treatment

61 27 22 17 14 12 9 9 8 3 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

29.3 13.0 10.6 8.2 6.7 5.8 4.3 4.3 3.8 1.4 1.4 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

number of articles with VMLs in their title. So the figures are a very conservative estimate of prevalence. The continued proliferation of new VMLs. Not only do VMLs continue to be used repeatedly in research reports, new 2 Applied Psychological Measurement, Educational Measurement: Issues and Practice, Journal of Educational Measurement, Psychometrika, and Language Assessment Quarterly. 3 Psychological Assessment, Journal of Personality Assessment, Assessment, Educational and Psychological Measurement, International Journal of Selection and Assessment, Journal of Psychoeducational Assessment, and European Journal of Psychological Assessment. 4 Certain VMLs, which we have called simple relational modifiers, are typically used simply to indicate the comparison of validities rather than to identify a particular way of thinking about validity. They include comparative, relative, maximum, initial, etc. The distinction between simple relational validities and more substantive ones was not always clear-cut: differential validity, for instance, is sometimes used in a simple relational way, but is frequently used in a more substantive manner; incremental validity has a relational component, but seemed sufficiently substantive to be included. Other VMLs, which we have called referent modifiers, relate simply to the referent of the validity claim. They include test, item, score, scale, assessment center, interviewer, questionnaire, instrument, measurement, argument, etc. (Item validity is sometimes used in a nonreferent way to capture the extent to which intended cognitive processes are elicited by an item.) 5 Educational and Psychological Measurement, Journal of Applied Psychology, and Journal of Personality Assessment.

STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY

7

Table 2 Comparative Prevalence of Validity Modifier Labels Over Time

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

1975–1980

2005–2010

EdPM (n ⫽ 86)

JPA (n ⫽ 19)

JAP (n ⫽ 14)

Total (n ⫽ 119)

EdPM (n ⫽ 24)

JPA (n ⫽ 28)

JAP (n ⫽ 11)

Total (n ⫽ 63)

Label

n

%

n

%

n

%

n

%

n

%

n

%

n

%

n

%

Construct Predictive Incremental Convergent Criterion Concurrent Discriminant Criterion-related Construct-related Factorial Internal Cross-cultural CrossExternal Population Consensual Extratest Incremental criterion-related Operational Local Differential Content Domain Single-group Diagnostic Concurrent criterion Congruent Discriminative Edumetric Empirical Face Interpretative Job component

8 27 2 3 0 12 7 1 0 16 1 0 0 0 0 0 0 0 0 0 0 2 3 0 0 1 1 0 1 1 0 0 0

9.3 31.4 2.3 3.5 0.0 14.0 8.1 1.2 0.0 18.6 1.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.3 3.5 0.0 0.0 1.2 1.2 0.0 1.2 1.2 0.0 0.0 0.0

7 1 0 1 0 1 3 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 0

36.8 5.3 0.0 5.3 0.0 5.3 15.8 0.0 0.0 0.0 0.0 5.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.3 0.0 0.0 5.3 0.0 0.0 5.3 0.0 0.0 5.3 5.3 0.0

2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 1 0 3 0 0 0 0 0 0 0 0 1

14.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 50.0 7.1 0.0 21.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 7.1

17 28 2 4 0 13 10 1 0 16 1 1 0 0 0 0 0 0 0 0 7 4 3 3 1 1 1 1 1 1 1 1 1

14.3 23.5 1.7 3.4 0.0 10.9 8.4 0.8 0.0 13.4 0.8 0.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.9 3.4 2.5 2.5 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8

9 3 1 1 0 1 2 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

37.5 12.5 4.2 4.2 0.0 4.2 8.3 4.2 4.2 4.2 4.2 0.0 4.2 4.2 4.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

11 1 4 2 4 2 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

39.3 3.6 14.3 7.1 14.3 7.1 0.0 3.6 0.0 0.0 0.0 3.6 0.0 0.0 0.0 3.6 3.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0 3 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0

0.0 27.3 9.1 9.1 0.0 0.0 9.1 9.1 9.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 9.1 9.1 9.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

20 7 6 4 4 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0

31.7 11.1 9.5 6.3 6.3 4.8 4.8 4.8 3.2 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Note. The table displays frequency of occurrence. EdPM ⫽ Educational and Psychological Measurement; JPA ⫽ Journal of Personality Assessment; JAP ⫽ Journal of Applied Psychology.

VMLs are continuously being invented. An unstructured survey of the wider literature, which sought to identify as many new VMLs as possible within the field of EPM, identified a whole host.6 They appeared within new VML-based taxonomies and as free-standing additions to the literature. They included general validity, specific validity (Tenopyr, 1986); representational validity, elaborative validity (Foster & Cone, 1995); prospective validity, retrospective validity (Jolliffe et al., 2003); formative validity, summative validity (Allen, 2004); site-validity, system-validity (Freebody & Wyatt-Smith, 2004); design validity, interpretive validity (Briggs, 2004); diagnostic validity (Willcutt & Carlson, 2005); translation validity (Trochim, 2006); structural validity, elemental validity (Hill, Dean, & Gaffney, 2007); cognitive validity, context validity, scoring validity (Shaw & Weir, 2007); manifest validity, semantic validity (Larsen, Nevo, & Rich, 2008); operational validity (Lievens, Buyse, & Sackett, 2008); extratest validity (Hopwood, Baker, & Morey, 2008); decision validity (Brookhart, 2009); cross-age validity (Karelitz, Parrish, Yamada, & Wilson, 2010); retrospective validity (Evers, Sijtsma, Lucassen, & Meijer, 2010); generic validity, psychometric validity, and relational validity (Guion, 2011).

The Validity of the Test Increasingly, in recent years, writers have expressed sentiments ranging from embarrassment to exasperation that measurement specialists continue routinely to disregard the original validity standard. For example, in a presidential address to the NCME, Frisbie (2005) lamented that validity continued to be the most misunderstood or widely misused of all terms, consistently being used in ways that contradicted the consensual understanding. He quoted numerous examples from the literature of authors using phrases like the test will be valid or the validity of the test or test validity. Frisbie was not the first, nor the last, to have made this observation. Twenty years earlier, Lawshe (1985) had observed essentially the same thing.

6 The survey only counted VMLs that had been published in “respectable” measurement books or journals. For instance, it did not count terms like intentional validity, observation validity, and representation validity, which had been found on the Internet but could not be traced to a traditional publication.

NEWTON AND SHAW

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

8

In an analysis of reviews from the 16th Mental Measurements Yearbook, published in 2005, Cizek et al. (2008) judged that 30% of all reviews referred to validity as a property of a test (cf. a score, inference, or interpretation), which corresponded to 55% of reviews that could be classified definitively. Our own research into the prevalence of VMLs noted the use of the validity in titles of articles published between 1975–1980 and 2005–2010, with some indication of greater frequency of usage in the earlier period. We also observed the use of (what we termed) referent modifiers; not just test validity, but item validity, score validity, scale validity, assessment center validity, interviewer validity, questionnaire validity, instrument validity, and measurement validity. All of these uses would appear to be out of kilter with the claim that tests are not the kind of thing that can be valid or invalid. To provide a little bit more insight into how valid is used within journal articles, we conducted a simple case study, based upon abstracts published in a leading journal of the field, Educational and Psychological Measurement. The online abstracts of articles published within two distinct periods were searched for the occurrence of the term valid. Each occurrence was coded, in terms of what, exactly, was being referred to as valid.7 Results are presented in Table 3. It is interesting to note that none of the 90 occurrences referred to a valid interpretation and only one referred to a valid inference. A substantial number of occurrences were coded as valid instrument (including instrument, test, subtest, test form, scale), and even more were coded as valid measure (including measure, measurement approach). The latter often read as though it were tantamount to valid test. The trend toward reference to valid measurement and valid scores (including scores, results, data) may, perhaps, hint at somewhat greater avoidance of reference to test validity during the later period. Of particular interest was the high prevalence of valid predictor (including predictor, predictions, criterion estimates) during both periods. May we refer to a valid predictor? Is it implicitly rejected by the Standards in the same way as a valid test? One would have thought so, although we have never noticed the claim stated explicitly. Equally, we would assume that talk of a valid item is dismissed, despite the concept of item validity having a pedigree in EPM that dates back long before the first edition was penned (e.g., Lindquist, 1936).

Table 3 Referents of the Term Valid Within Educational and Psychological Measurement January 1974 to January 1980 Referent

Frequency (n ⫽ 49)

Indicator Instrument Measure Measurement Predictor Scores Other

4 12 10 0 14 1 8

January 1993 to November 2012

%

Frequency (n ⫽ 41)

%

8.2 24.5 20.4 0.0 28.6 2.0 16.3

0 5 9 5 6 9 7

0.0 12.2 22.0 12.2 14.6 22.0 17.1

Explaining the Disjunction Between Standards and Custom and Practice There are at least three major categories of explanation for the disjunction between standards for talking about validity and how validity is actually talked about in the published literature: • intentional misuse— understanding the consensus conception, and accepting it, but choosing to use nonconsensus language (i.e., choosing to disregard standards for talking about validity but not standards for thinking about validity); • lack of awareness or misunderstanding—not understanding the consensus conception, and using nonconsensus language (i.e., not intentionally disregarding standards for talking about or thinking about validity); and • genuine divergence— understanding the consensus conception, but rejecting it, and choosing to use nonconsensus language (i.e., choosing to disregard standards for talking about and thinking about validity). These three categories provide a useful structure for reflecting upon the evidence presented above; so, for each standard in turn, we shall illustrate each category of explanation.

Thou Shalt Not Refer to the Validity of the Test We begin by illustrating the range of reasons that we have encountered for referring to validity as though it were a property of tests. Intentional misuse. It is not uncommon for writers to note, apologetically, that although they fully accept the consensus position on validity, they will lapse into loose talk because it is easier or more comfortable to do so: for example, “We sometimes speak of the ‘validity of a test’ for the sake of convenience, but it is more correct to speak of the validity of the interpretation and use to be made of the results” (Miller, Linn, & Gronlund, 2009, p. 72). Many authors agree that when measurement specialists refer to TVOTT they often do so “elliptically” (Kane, 2009, p. 40), or as “shorthand” (Guion, 2009, p. 467; Landy, 1986, p. 1186; Zumbo, 2009, p. 67), or merely as “a matter of convenience” (Reynolds, Livingston, & Willson, 2010, p. 124). In such cases, so the argument goes, there is no rejection of standards for thinking about validity, only of standards for talking about validity. Lack of awareness or misunderstanding. A more worrying explanation of why measurement specialists still so frequently refer to TVOTT is that they are ignorant of validity standards, or fail to understand the principles underlying them. Although there have been no systematic investigations into this possibility, many suspect it to be true (e.g., Hubley & Zumbo, 1996). Commenting more narrowly upon the state of personnel psychology, Guion (2009) suggested, in exasperation, that even the traditional notion 7 The research was originally intended to involve two 20-year periods: beginning of 1993 to end of 2012 and beginning of 1961 to end of 1980. In fact, the online abstracts for Educational and Psychological Measurement are only stored electronically, and therefore only searchable, back to 1974. As it happened, the period from 1974 to 1980 returned more instances of valid than the period from 1993 to 2012, so this sufficed for rough comparative purposes. During the later period, no abstract contained the term valid more than once. During the earlier period, it was not uncommon for valid to appear more than once in a single abstract. Where there was more than one occurrence, only the first was coded. The coding was very straightforward and almost always unambiguous.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY

of validity was still not yet understood, let alone the modern one. He wondered whether this was because members of his profession had simply not studied the relevant literature. Genuine divergence. It is important to acknowledge that some people who refer to TVOTT will do so intentionally, consistent with beliefs that depart from the consensus position. In articles reminiscent of the Hans Christian Andersen fable The Emperor’s New Clothes, Borsboom, Mellenbergh, and van Heerden (2004) and Borsboom, Cramer, Keivit, Scholten, and Franic (2009) argued forcefully against the view of validity as a property of interpretations, claiming that validity is necessarily a property of tests. Even more controversially, they claimed that this is the de facto consensus view among measurement specialists: rejected by construct validity theorists, but embraced by “the rest of the inhabitants of the scientific world” (Borsboom et al., 2009, pp. 163–164). Not having cited the Standards in either article, it is unclear whether Borsboom and colleagues appreciated that they were not simply challenging an informal consensus amongst modern-day construct validity theorists, but the official position of the EPM supracommunity since the Standards was first penned. Interestingly, in their wake, other dissenters have made similar views known, including Lissitz and Samuelsen (2007). Immediate reflections. Oddly enough, it is technically possible to refer to TVOTT without disregarding the original validity standard. Note how each edition specified that it was incorrect to use the “unqualified” phrase TVOTT. Presumably, then, if the phrase is qualified, it ought to be acceptable to speak of test validity after all. Note the following, from the Standards and Messick (1989), respectively: If the validity of the test can reasonably be expected to be different in subgroups which can be identified when the test is given, the manual should report the validity for each group separately or should report that no difference was found. (APA et al., 1954, p. 26) First, a test that is valid for a job or task in one setting might be invalid (or have a different validity) for the same job or task in a different setting, which constitutes situational specificity per se. Second, a test that is valid for one job might be invalid (or have a different validity) for another job albeit in the same setting, which is better described as job specificity. (Messick, 1989, p. 82)

Although even Messick often used phrases like test validity without explicit qualification, these two examples are useful in highlighting the possibility that people who refer to TVOTT may do so with a clear, albeit largely implicit, presumption of qualification. It certainly seems that when Borsboom refers to TVOTT, he fully accepts that the test might be valid for one particular group of students while invalid for another, or valid for one interpretation and use of results yet invalid for another (see Borsboom, 2012; Borsboom & Mellenbergh, 2007, pp. 104 –105). More generously, still, if the term test is interpreted to mean measurement procedure, in its broadest sense—including instrument, administration procedure, scoring procedure, and intended interpretation—then this may dissolve the notion of divergence between the two camps entirely, rendering debate over the legitimacy of terms like test validity something of a red herring (Newton, 2012b). However, the debate is not quite so easily dissipated because there are actually two further standards for talking and thinking about validity lurking here:

9

3.

Thou shalt use the term validity when evaluating decision making procedures (i.e., it is correct to speak of the validity of the use of test scores, as well as the validity of their interpretation).

4.

Thou shalt use the term validity when evaluating impacts from measuring (i.e., it is correct to speak of the validity of the overall testing policy).

Scholars like Borsboom and Mellenbergh (2007) reject both of these standards, claiming that they reflect professional issues that ought to be described with different terms. Conversely, they propose, the concept of validity, and therefore talk about validity, ought to be restricted to scientific issues, that is, to issues of measurement (see also Cizek, 2012; Scriven, 2002). However, it is worth noting that some scholars have drawn precisely the same distinction between professional and scientific interests, yet have reached precisely the opposite conclusion, that is, that validity ought to be restricted to professional talk of decision making and not be used for scientific talk of measurement (e.g., Gaylord & Stunkel, 1954). Successive editions of the Standards have always discussed decision making as a part of validity, particularly as the focus for criterion-related validation. The inclusion of impacts, however, is a more recent and a more controversial addition. Even the fifth edition of the Standards is ambiguous on this matter. Newton (2012a) has argued that it upholds the third standard, but not necessarily the fourth.

Thou Shalt (Not) Use VMLs We have not uncovered any explicit discussion on reasons for the proliferation of new VMLs beyond the few defined in the Standards (either between 1954 and 1984 or subsequently), so the following sections focus particularly upon explanations that have been offered for the continued use of traditional VMLs following their official rejection in 1985. Intentional misuse. The use of VMLs has not been debated widely in the literature, although a number of arguments for and against have been proposed, particularly in relation to content validity. Yalow and Popham (1983), for instance, warned that relabeling the term might substantially reduce attention to content coverage within validation. Fifteen years later, Sireci (1998) suggested that this had indeed occurred. Shepard (1993) took a contrary view, however, believing that the replacement of the “x validity” formulation with “x-related evidence of validity” in the fourth edition of the Standards had failed to flag the important conceptual change sufficiently, noting the persistence of inappropriate conceptions even within measurement journals from the 1990s (see also Moss, 1995). Despite having feet firmly rooted in modern validity theory, Sireci (1998) staunchly defended the continued use of the content validity label, arguing that if validity is understood in the everyday sense of the logical grounding of a claim, then the VML formulation is still technically correct; new terms for describing the family of issues and procedures fundamental to content-related evaluation—content relevance, content representation, and domain definition—fail to cohere as a group; and the idea of content validity is far easier for nonpsychometric audiences to comprehend.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

10

NEWTON AND SHAW

As it happens, each of Sireci’s three reasons for continuing to use the term content validity might be challenged. First, reverting to an everyday conception is inconsistent with the attempt to specify a precise technical meaning for validity, specific to EPM, which the EPM communities have aspired to for the best part of a century. It is also fair to say that there are many everyday senses of validity, so deference to a particular one might be considered arbitrary. So the idea that common sense furnishes a satisfactory technical meaning, consistent with the use of the term content validity, seems problematic. Second, the family of issues and procedures fundamental to content-related evaluation could, quite straightforwardly, be grouped through the use of content rather than validity. So this argument, too, seems at least debatable. Third, the suggestion that content validity is easier for lay audiences to understand is presumably meant to imply that the traditional caricature of validity is easier to understand than the modern view. This may be true, but whether it is appropriate to continue promulgating a spurious view of validity is questionable, even when communicating with validity novices. Was it not the traditional oversimplification of validity that got us into trouble in the first place (see Dunnette & Borman, 1979)? The more general claim that content validity is easy to understand is also questionable, in light of the very many different versions of content validity that are still in circulation, as the Internet bears testament to. The fact that Sireci (2007) failed to recognize the version defended by Lissitz and Samuelsen (2007) suggests that the idea of content validity is not quite as unproblematic as might be assumed. In short, there are certainly questions to be raised in response to a purely pragmatic defense of the continued use of VMLs. Sireci is certainly not alone in claiming educational benefits from continuing to use VMLs. In their textbook on psychological testing, McIntire and Miller (2007) began their discussion of validity with reference to the modern view and explained the five sources of evidence from the fifth edition of the Standards. However, three subsequent chapters focused explicitly upon the traditional characterization from the second edition: content validity, in which they included face validity; criterion-related validity, including predictive validity and concurrent validity; and construct validity, including discriminant validity and convergent validity. They justified their traditional presentation on the basis that “a student would not be able to interpret more than 100 years of testing literature, including case law, without a strong understanding of the three traditional types of validity” (p. 224). Lack of awareness or misunderstanding. Cizek et al. (2008) conducted one of the very small number of empirical studies into the appropriation of standards for talking about validity, based upon an analysis of reviews prepared for the 16th Mental Measurements Yearbook. They judged that only seven of the 283 reviews used language consistent with the modern view and that the most common convention was to use language consistent with the traditional view (i.e., making reference to types of validity). They noted an explanation that had been offered for a similar phenomenon, some years earlier, by Shepard (1993), that practicing psychometricians do not actually understand the theory that they claim to be applying. This explanation is consistent with an observation from Camara and Lane (2006) that, in many instances, practitioners may be unfamiliar with their professional standards

and have little exposure to new developments in assessment during their graduate training. Cizek et al. (2008) also offered a slightly different kind of explanation, that practitioners may fail to read any deep significance into the language that they use, such that the distinction between, say, content validity and content-related evidence of validity is not especially salient for them. Their knowledge of modern validity theory might not be entirely lacking, but they might still fail to appreciate why (or even that) using the language of traditional validity theory is problematic. In other words, they may have begun to appropriate the new standard for thinking about validity, without having appropriated the new standard for talking about validity, that is, the rejection of VMLs. Cizek et al. recommended a more aggressive promulgation of such standards in future years to overcome this challenge. Genuine divergence. From inspection of the literature alone, it would be hard to tell whether those who simply used VMLs did so with a clear understanding of the Standards and, therefore, with an appreciation of how their use related to the consensus position. On the other hand, we might hope that those who ventured to invent new VMLs, who aspired to be validity scholars, would do so with at least some appreciation of the manner in which they were diverging from established standards for talking about validity. The following sections reflect upon the use and invention of VMLs during two phases: pre- and post-1985. 1954 to 1984. For 3 decades, the consensus of the EPM supracommunity remained essentially unchanged: There were basically just three kinds of validity— content, construct, and criterion—and evaluators should make explicit which of the three they were talking about whenever validity was to be claimed. Naturally, there were scholars who explicitly disagreed with the consensus position and who proposed new VMLs to correct it (e.g., Cattell, 1964). More interesting, though, were the scholars whose new VMLs were proposed in order to elaborate upon, rather than to challenge, the Standards (e.g., Campbell, 1960; Campbell & Fiske, 1959; Lawshe, 1952; Sechrest, 1963). These elaborations represented only minor divergence from the Standards, each implying that the four validities of the first edition failed to capture all the important distinctions. The fact that the second edition, published in 1966, included no discussion of these proposed elaborations indicates that they made no substantive impact on the consensus position. In fact, not only was the taxonomy in the first edition deemed to have captured all the important distinctions—that is, all the important “interpretative inferences” (APA et al., 1974, p. 26)—it actually reduced the number of validities from four to three. Despite this tacit rebuttal, new VMLs continued to be invented. Some of these diverged significantly from the consensus position (e.g., Carver, 1974; Lord & Novick, 1968); others could be seen more as elaboration (e.g., Popham, 1978). Incidentally, although the invention of wholly new types of validity during this period would appear, at least by implication, to represent genuine divergence from the Standards (e.g., Bemis, 1968; Boehm, 1972; Dick & Hagerty, 1971), they were not necessarily presented as such. In summary, this early proliferation of VMLs seems to represent a groundswell of dissatisfaction with standards for thinking and talking about validity presented within the Standards; although it is fair to say that this was not always presented as explicit divergence or dissatisfaction. Even some of the most influential scholars

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY

of the day felt the need either to elaborate upon the consensus position or to challenge it, through the invention of new VMLs. Ironically, this dissatisfaction—in stark contrast to the position that was to be championed by Messick—seemed to argue for increased fragmentation, not unification. 1985 to present-day. The continued use of VMLs, following the official unification of validity theory in the fourth edition of the Standards, represents a disregard of standards for talking about validity. However, in the absence of further research, it would be impossible to determine confidently either the extent to which this represented intentional disregard or the extent to which it also represented a rejection of standards for thinking about validity. Nonetheless, the continued use of terms like predictive validity, content validity, incremental validity, and factorial validity in titles from prominent EPM journal articles does seem to hint at a certain amount of genuine divergence and dissatisfaction. All of these articles presumably passed through a review process, with their titles, at least, reviewed by the journal editor. If the continued use of traditional VMLs were more careless than intentional, or more casual than formal, we might expect this to have been picked up during the review process. When it comes to the invention of new VMLs, the case is stronger still, that these are proposed by people who claim to be validity scholars, so we would certainly hope that any divergence from the consensus position was intentional. However, in explicitly disregarding the Standards, we might also expect them to comment upon this, and this was not always the case. The continued proliferation of new VMLs during this phase remains ironic, but even more so now, because it represents the extension of a trend toward increased fragmentation of the concept of validity against the backdrop of its official unification. It is odd that the peculiarity of this phenomenon seems not to have been widely discussed. Immediate reflections. The most ironic new VML of recent years is consequential validity, a term that is now common in the literature. It has been the focus of much debate concerning the fourth standard for talking about validity mentioned earlier. The irony derives from the fact that the term continues to be attributed to Messick (e.g., Lissitz & Samuelsen, 2007, p. 445) despite the fact that it was Messick who wrote the definitive critique of VMLs (Messick, 1980). Messick was interested in consequential evidence of validity, but explicitly refrained from using the term consequential validity, for obvious reasons. This slip raises an interesting and important question: How is it that even scholars of validity occasionally fail to see the irony in attributing the term consequential validity to Messick (1989)? It is almost as though there were something inherently incorrigible concerning loose talk about validity, something that repeatedly defies any attempt to control it. It seems true for talk of test validity, and it seems true for the use of VMLs. To be fair, though, the official rejection of VMLs is not without consequence. It leaves the supracommunity without a multitude of terms for identifying distinctive ideas within what has now become an extremely broad concept. Messick (1980) provided a list of terms that might be substituted for some of the most popular VMLs, but these have largely failed to transfer into mainstream discourse. No one else, to our knowledge, has either extended his list or offered an alternative one.

11

The official rejection of VMLs also leaves the supracommunity in a bizarre position whereby official standards dismiss the use of VMLs—so there are no longer any official definitions of content validity, predictive validity, factorial validity, etc.—yet these terms remain a feature of custom and practice (i.e., of everyday discourse between measurement specialists). Where, then, ought validity novices to turn in order to find out what their colleagues and peers are talking about? The more new VMLs appear on the scene, the greater the potential for confusion within the supracommunity—let alone beyond it— especially in light of their removal from the Standards. To date, we have identified 122 discrete VMLs, each invented to capture some aspect or another of validity for measurement (see Table 4). We have identified another 35 that seem to be no more than synonyms for those presented in Table 4. Within Table 4 are VMLs that express discriminable, but actually quite similar, concepts: for example, logical, rational, content, curricular, face, and context; empirical, practical, and criterion; local and situational; intrinsic and construct. Then there are VMLs that appear only once in Table 4 but that have been endowed with completely different meanings by different measurement scholars, such as decision validity (e.g., Hambleton, 1980, vs. Brookhart, 2009), differential validity (e.g., Richardson, 1936, vs. Linn, 1978), internal/external validity (e.g., Guttman, 1950, vs. Loevinger, 1957), intrinsic validity (e.g., Gulliksen, 1950, vs. Guilford, 1954, vs. Cureton, 1965), functional validity (e.g., Popham, 1978, vs. Cone, 1995), practical validity (e.g., Guilford, 1946, vs. Campbell, 1960), prospective validity (e.g., Jolliffe et al., 2003, vs. Hoffman & Davis, 1995), psychometric validity (e.g., Carver, 1974, vs. Guion, 2011), retrospective validity (e.g., Jolliffe et al., 2003, vs. Evers et al., 2010), semantic validity (e.g., Burns, 1995, vs. Larsen et al., 2008, vs. Hanlon et al., 2008), and structural validity (e.g., Hill et al., 2007, vs. Loevinger, 1957). There are also VMLs for measurement that have a different meaning as VMLs for research, such as construct validity (e.g., Cronbach & Meehl, 1995, vs. Cook & Campbell, 1979, vs. Lather, 1986), internal/external validity (e.g., Guttman, 1950, vs. Campbell, 1957), descriptive validity (e.g., Popham, 1978, vs. Maxwell, 1992), relational validity (e.g., Guion, 2011, vs. Julnes, 2011), face validity (e.g., Guilford, 1946, vs. Lather, 1986), and interpretive validity (e.g., Briggs, 2004, vs. Maxwell, 1992). Finally, the meanings of some of the oldest and most popular VMLs have multiplied steadily over time, to the point where it is almost impossible to say what their real meaning might once have been, such as construct validity (e.g., Bechtoldt, 1959; Borsboom et al., 2009; Kane, 2008; Loevinger, 1957; Maraun, Slaney, & Gabriel, 2009; Messick, 1992; Smith, 2005), content validity (e.g., Ebel, 1983; Fitzpatrick, 1983; Guion, 1977a, 1977b; Lennon, 1956; Messick, 1975; Murphy, 2009; Sireci, 1998; Yalow & Popham, 1983), and face validity (e.g., Guilford, 1946; Mosier, 1947; Nevo, 1985; Rulon, 1946). In fact, having reviewed the variety of definitions provided for face validity, content validity, and construct validity, respectively, Mosier (1947), Fitzpatrick (1983), and Guion (2011) each suggested that their respective terms be abandoned. Guion even observed that the term validity might have outlived its usefulness. Although face, content, and construct have probably had more meanings associated with them than any other VML, a similar story of ambiguity could also be told for many more, including factorial validity, differential valid-

NEWTON AND SHAW

12

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Table 4 One Hundred and Twenty-Two Kinds of Validity for Measurement Administrative Artifactual Behavior domain Cash Cluster domain Cognitive Common sense Concept Conceptual Concrete Concurrent Concurrent true Congruent Consensual Consequential Construct Constructor Content Context Contextual Convergent Correlational Criterion Cross-age Cross-cultural Cross-sectional Cultural Curricular Decision Definitional Derived

Descriptive Design Diagnostic Differential Direct Discriminant Discriminative Domain Domain-selection Edumetric Elaborative Elemental Empirical Empirical-judgmental Etiological External test External Extratest Face Factorial Fiat Forecast true Formative Functional General Generalized Generic Higher-order Incremental Indirect Inferential

Instructional Internal test Internal Interpretative Interpretive Intrinsic Intrinsic content Intrinsic correlational Intrinsic rational Item Job component Judgmental Linguistic Local Logical Longitudinal Lower-order Manifest Natural Nomological Occupational Operational Performance Practical Predictive Predictor Procedural Prospective Psychological and logical Psychometric

Rational Raw Relational Relevant Representational Response Retrospective Sampling Scientific Scoring Self-defining Semantic Single-group Site Situational Specific Structural Substantive Summative Symptom Synthetic System Systemic Theoretical Trait Translation Treatment True User Washback

Note. A fully referenced list of all of the kinds of validity that are mentioned in this table is available from the first author.

ity, incremental validity, instructional validity, curricular validity, and so on. It is worth mentioning in passing that many of the new VMLs of recent years have been quite insubstantial. For example, operational validity appeared only in the title of the article by Lievens et al. (2005) and seemed merely to imply validity within an operational setting. Similarly, occupational validity appeared only in the title of the article by Bemis (1968), referring to little more than validation in an occupational setting. Extratest validity appeared only in the title and abstract of Hopwood et al. (2008) and was not actually defined in the article. Likewise, other than in the title, elemental validity appeared only once in the article by Hill et al. (2007), and structural validity appeared only in the title. In short, the invention of a new VML makes for a snappy title but often conveys more style than substance. Finally, it is tempting to speculate that there may be just a few basic kinds of validity, or categories of validity evidence, into which the vast majority of the VMLs that have been proposed over the years can straightforwardly be collapsed: perhaps the five categories of evidence from the 1999 Standards, or even the four kinds of validity from the 1954 Standards. There is certainly some truth in this speculation. For instance, the majority of VMLs that we identified were introduced as part of an explicit scheme for classifying aspects of validity. Almost all of these schemes highlighted contrasts very similar to those drawn in the original 1954 Standards; some-

times excluding traditional kinds/categories (e.g., excluding predictive), sometimes including new kinds/categories (e.g., including consequential), sometimes subdividing traditional kinds/categories (e.g., dividing construct into trait and nomological). The new VMLs were often introduced to foreground subtle, albeit important, differences in emphasis (e.g., cognitive compared with content), rather than radically different ways of thinking about validity. When we attempted to classify our comprehensive list of VMLs in terms of “broad similarity” to the Trinitarian scheme, we found that the large majority overlapped significantly with the traditional three categories, with some spanning two or all three of them. A substantial minority, around 20%, did not fit comfortably into any (e.g., cash, common sense, cross-age), although it is fair to say that few of these feature heavily in the literature. Those VMLs that did not fit comfortably, but that do feature significantly in the literature, included incremental, procedural, and three that correspond to the consequences category from the 1999 Standards: systemic, washback, and consequential. These findings provided some informal support for the utility of the five-way classification of sources of evidence from the 1999 Standards; based, as they were, upon the threeway classification of sources of evidence from the 1985 edition, expanded to encapsulate response as well as content sampling, and to include consequences; this having, in turn, derived from the three kinds of validity within the 1966 edition.

STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY

The Desirability of Standards for Thinking and Talking About Validity We end by reflecting upon the desirability and viability of standards for talking and thinking about validity.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Standards for Thinking About Validity A central component of scientific practice is persuasion, that is, the attempt by one or more scientists to bring others around to their view. In this respect, consensus is the holy grail of science, and standards for thinking about validity therefore represent an appropriate ambition for EPM, even from a purely scientific perspective. From the perspective of the scientist, of course, these would be descriptive standards, not prescriptive ones. It would be the antithesis of science to require all scientists to work within a common paradigm. What, then, of prescriptive standards, like those found in the Standards? These are more pragmatic, sometimes legalistic, devices. They specify principles of professional practice that help to establish the credibility of practitioners within a community and the credibility of the profession within society. Educationalists and psychologists around the world recognize the significance of prescriptive standards, through which to establish and defend the credibility of measurement practice. The prescriptive professional standards of EPM in North America have built upon descriptive scientific standards. Thus, despite their pragmatic focus, successive editions of the Standards have sought to establish their credentials rationally, by being grounded in the scientific paradigms of their day, from the traditional construct validity of Cronbach and Meehl (1955) to the modern construct validity of Messick (1989). The introductory text to each of the successive validity chapters expressed the consensus judgment of the EPM professions concerning the descriptive standard of the day for thinking about validity. The Standards therefore effectively prescribe a particular way of thinking about validity as a rational foundation for measurement practice. The idea of consensus seems doubly important in establishing credibility: consensus, on the one hand, between scientific and professional conceptions of validity, and consensus, on the other, among EPM professionals in how they conceptualize validity. One final point: Standards for thinking about validity would seem to be important if consensus is to be reached on how to define the many other technical characteristics through which EPM is to be evaluated (e.g., reliability, bias, fairness).

Standards for Talking About Validity The conclusion that standards for thinking about validity are important does not entail an obligation upon measurement specialists to uphold corresponding standards for talking about validity. So why were such standards (prescriptive, no less) ever thought to be necessary? This only really makes sense against a backdrop of negative impacts arising from inappropriate talk. The context within which the first edition of the Standards was published exhibited these features. Claims to validity were misinterpreted as though, for instance, a single correlation coefficient could sanction the use of a test for any purpose under any condition. Standards for talking about validity were intended to help to rectify this by

13

helping users to understand the conditional nature of any claim to validity. Now, assuming that there used to be a reasonable case for promoting standards for thinking about validity, does the same remain true today? The fact that the Standards rejected the use of VMLs in the mid-1980s suggests that standards for talking about validity continued to be important. It was recognized that talking as though there were different types of validity had led measurement specialists to think about validity as a fragmented concept, with consequent negative impacts upon validation (Dunnette & Borman, 1979; Messick, 1980). Frisbie (2005) insisted that similar negative impacts from disregarding standards for talking about validity continued to occur even into the 21st century, from poor testing practice to weak validation, to widespread miscommunication within and beyond the professions. In short, there are good pragmatic reasons to think that standards for talking about validity are desirable, to help clarify standards for thinking about validity and validation.

The Viability of Standards for Thinking and Talking About Validity We began by asking why—if there is supposed to be a consensus over standards for talking and thinking about validity—the standards continue to be disregarded in practice. We presented new evidence to illustrate this phenomenon, and discussed possible reasons that included intentional misuse, lack of awareness or misunderstanding, and genuine divergence from the consensus. Our historical analysis demonstrated an enduring lack of consensus concerning standards for talking about validity. It is hard to reach any definitive conclusion concerning the extent to which the disregarding of standards for talking about validity represents deeper dissatisfaction with standards for thinking about validity. Although there does appear to be an element of genuine frustration with the terminology for marking important dimensions of quality in EPM, our general impression from reading the literature on validity theory is that there is little appetite for returning to a fragmented conception. We do, however, note substantial disagreement over how, and to what, the term validity ought to be applied, representing a fundamental lack of consensus over standards for thinking about validity. We end by highlighting four outstanding challenges and a strategy that might go some way toward ameliorating them. The first two challenges relate to the use of VMLs, and are in tension. On the one hand, it is clear that VMLs not only continue to be used but continue to be invented. This inextinguishable desire to fragment would seem to be the antithesis of unification. Yet, whether dimensions of quality in EPM can be more effectively marked through an expansion of the official lexicon would seem to remain an open question. Clearly, if the move toward unification has meant a blurring of important distinctions, then it has made it harder to teach validity, harder to learn validity, and thus increased the risk of lack of awareness and misunderstanding. On the other hand, it is clear that the rampant proliferation of VMLs has not served EPM well. The only VML-based taxonomies that have ever gained widespread respect are to be found in the first three editions of the Standards. And the individual VMLs that have been proposed by so many, over so many years, simply do not cohere as a substantive contribution to validity theory; not that

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

14

NEWTON AND SHAW

they have ever been presented as such. In fact, they are downright confusing: There are so many of them; some use the same term for different meanings; some use different terms for the same meaning; some of them seem extremely trivial; and so on. In short, there seems to be a need for standards for talking about validity that are capable of marking all the important distinctions without being distracted by unimportant ones. We are certainly at liberty to ask whether the categories used in the fifth edition of the Standards are optimal in this respect, although, as we discussed earlier, they do seem to resonate, at least, with the vast majority of VMLs that have been proposed over the years. The second two challenges relate to the TVOTT debate. As we explained earlier, there actually seems to be little disagreement over the principle that any claim to validity is conditional, that is, little debate over this standard for thinking about validity, only over the corresponding standard for talking about it. Those who insist on referring to TVOTT tend simply to presume conditionality and see no need to mark it discursively. The third challenge, therefore, is how we might resolve this standoff. There is, however, a more substantial debate lurking below the surface, that is, genuine disagreement over the level (or levels) at which a claim to validity might be staked. Four such levels illustrate the spectrum of opinion: • the elements of the measurement procedure (e.g., “the item is valid”), • the measurement procedure (e.g., “the test is valid”), • the decision procedure (e.g., “the use of the test is valid”), and • the testing policy (e.g., “the system is valid”). At each of these levels, the purpose of declaring its subject valid is, in effect, to declare that its subject is fit to be used as one component of a higher level process: The item is fit to be used in the measurement procedure; the measurement procedure is fit to be used in the decision procedure; the decision procedure is fit to be used in the testing policy; the testing policy is fit to be used in the construction of a good society. With each new level, the claim to validity concerns different kinds of conclusion, derived from different kinds of evidence and analysis. Since the mid-1950s, successive editions of the Standards have always adopted a fairly broad conception of validity, tailored ultimately to the intended use of test scores (i.e., to the decision procedure). In recent years, many have wanted to extend validity to the level of testing policy. There are two related problems here. First, some now believe that the concept of validity has become too global to be useful (e.g., Brennan, 1998). It is not just that validity has become very hard to grasp and to communicate. It is also that, as it has moved beyond traditional territories and boundaries, it has become increasingly tricky to operationalize. If validation is to include an evaluation of measurement aims, decision making aims, and broader testing policy aims, then who ought to coordinate evaluation on this scale, and who ought ultimately to be responsible for it? Concerns such as these, and others, have split the field into those who insist that validity ought to be considered a narrow, scientific concept (e.g., Borsboom et al., 2004; Cizek, 2012; Lissitz & Samuelsen, 2007; Maguire, Hattie, & Haig, 1994; Mehrens, 1997; Popham, 1997; Scriven, 2002) and those who insist that it ought to be considered a broad, scientific and ethical one (e.g., Cronbach, 1988; Kane, 2013; Linn, 1997; Messick, 1980; Shepard, 1997).

Second, the failure to restrict talk of validity to a particular level undermines any attempt to specify a precise technical meaning for validity within EPM. For example, many measurement specialists are quite happy to refer to the validity of decisions and interpretations and test scores and tests and questions, and so on (e.g., Pollitt, 2012). Yet, the more liberally we use the term, the less precise its meaning becomes. A reviewer of the first draft of this article commented that the attempt to provide a precise technical definition of validity was in vain because it is a family resemblance concept; that is, there are clusters of features associated with all uses of the term validity, but no one feature that is common to all. This very helpfully gets to the crux of the matter, whether validity is, or could be fashioned into, a family resemblance concept. In terms of the current status of validity, there are three alternatives: Its meaning is captured by a precise technical definition; in the absence of a precise technical definition, its meaning is captured by unwritten rules that govern its application (i.e., it functions as a family resemblance concept); or it has no clear meaning and it tends to be used indiscriminately, arbitrarily, or in all sorts of different ways. As we have seen, the North American EPM supracommunity has been trying to provide a precise technical definition of validity for the best part of a century. The first official definition—framed exclusively in terms of measurement quality—was fairly precise: the degree to which a test measures what it purports to measure. As this definition was expanded to include prediction, it became less precise. When the concept was officially fragmented into a small number of kinds, it came to elude definition. The proliferation of unofficial VMLs epitomized and exacerbated this tolerance of imprecision. Subsequently, the unification of validity encouraged us to embrace precision once again, reestablishing measurement quality (i.e., score meaning) as the essence of all validity (see Newton, 2012a). Nowadays, though, it is clear that the term is used in all sorts of different ways, many of which appear to conflict with the official consensus position. Indeed, in practice, there does not seem to be any consensus over the proper application of the term: Some say it applies simply to tests; others say it applies to interpretations, or even to systems; while others say that it applies to items, to testing policy, and to anything in between. Even the official consensus position itself is somewhat vague and confused (Newton, 2012a). In summary, unlike the field of formal logic, where it has been possible to agree upon a precise technical definition for validity, it has not been possible to reach agreement in the field of EPM. More importantly, though, nor has it been possible to reach agreement upon its proper application in the absence of a precise technical definition (i.e., validity fails even to count as a family resemblance concept). The failure to reach agreement over a precise technical definition, despite a century of negotiation, suggests that it may not be a viable option. Yet, might it still be possible to negotiate meaning for validity as a family resemblance concept? This is the fourth and most fundamental challenge. As a final aside, we briefly return to the challenge of conditionality. Recall that reference to TVOTT was dismissed because any claim to validity is conditional. To many, it seemed that referring to validity as a property of interpretations, not tests, provided a straightforward solution to this problem. Yet, this would only be true if interpretation were somehow immune to conditionality, or if conditionality were somehow built into interpretation. The for-

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY

mer is clearly not true. Any interpretation (of test scores) will be conditional (e.g., upon whether the test had been administered properly, upon whom it was administered to, and so on). Likewise, the latter is simply not feasible. It would be impossible to identify each and every possible condition upon which the validity of the intended interpretation rested, either from a practical perspective or from a logical one. On reflection, it seems that the threat of misunderstanding associated with TVOTT arises from the decision to declare anything valid or invalid, be that a test, an interpretation, or a testing policy. First, the grammar of the term invites an absolute interpretation, because it encourages us to think in terms of black and white, valid or invalid. Second, as soon as the term is applied to anything—for example, an element of a measurement procedure, a measurement procedure, a decision procedure, or a testing policy—the declaration of validity functions like a stamp of approval, a green light to proceed, or a license to practice. It declares, in a pretty absolute manner, fitness-for-purpose. As validity is declared, conditionality slips out of view. Assuming that it may not be possible to agree upon a precise technical definition for validity, what are the prospects for fashioning it into a family resemblance concept by agreeing upon parameters for its application? The trend nowadays, even among many validity theorists, seems to be to apply the term fairly liberally, to interpretations and uses and even to impacts from testing that bear no relation to measurement or decision making. If, as we have just discussed, the real problem with talk of TVOTT is declaring anything valid (i.e., it is not solved by restricting the term validity to interpretations and uses), then reaching agreement upon a very broad use of the term—applicable to items, to testing policy, and to anything in between—appears to be far more viable. There could be much to recommend this strategy. Kane (2012) argued that embracing a broad conception of validity increases the likelihood that important evaluation concerns are not overlooked (see also Bennett, 2012). Pollitt (2012) argued that embracing a broad conception of validity provides us with the conceptual means to hold everyone involved in test development to account. Agreeing upon such a permissive standard would certainly suggest that we had fashioned validity into a family resemblance concept. Indeed, it seems likely that we would thereby have created a concept very much like quality: quality of the item, quality of the test, quality of the testing policy. If so, then it would behoove us to consider whether the concept of validity actually captured anything beyond the concept of quality. Indeed, if the use of validity became indistinguishable from the use of quality—and it is hard to see what might distinguish the two concepts construed so liberally—then why would we retain the concept of validity at all? The concept of quality is transparently liberal and has the advantage of having a general, nontechnical, commonsense meaning. It therefore invites interlocutors to clarify what they might mean by quality in the particular context of application. The concept of validity, by way of contrast, is quite opaque, having become obscured by a century of attempts to imbue it with precise technical meaning. Moreover, a legacy of having thus strived for precision is the risk that interlocutors will presume that the matter had now been settled, potentially discouraging them from clarifying what they might mean by validity in the particular context of application. Ultimately, if the way in which we chose to use the

15

term validity rendered it tantamount to quality, then we would be well advised simply to talk of quality. Usefully, the grammar of quality would help to discourage us from making unnecessary declarations of fitness-for-purpose, simply because it has no direct analogue for valid. How often do we ever really need to declare anything within the field of EPM either valid or invalid? In those rare instances when declaration is deemed to be essential, the use of alternative terms such as legitimate or defensible would carry less risk of conveying inappropriate surplus meaning. Declaring a procedure valid transforms validity into an all-or-nothing concept (Newton, 2012a), which many experts consider to be an inappropriate and harmful image (e.g., Markus, 2012; Pollitt, 2012). The less frequently we make such declarations, the less this image is promulgated. The grammar of quality helpfully discourages all-or-nothing thinking. Referring to quality, instead of validity, might also help to extinguish a long-standing confusion between validity and reliability. Debate continues over how best to theorize the relationship between these two: as though they represent largely distinct characteristics (e.g., Cattell, 1964); as though they represent regions on a single continuum (e.g., Campbell & Fiske, 1959; Marcoulides, 2004; Thurstone, 1931); or as though one, reliability, is simply a dimension within the other (e.g., Cureton, 1951; Kane, 2004; Messick, 1998). Quality, on the other hand, naturally establishes itself as a superordinate category within which reliability might comfortably reside as a dimension, thereby helping to achieve the synthesis recommended by Cureton, Messick, and Kane. The most important motivation for embracing the concept of quality, and abandoning the concept of validity, is based upon the empirical evidence amassed in preceding sections. Over a period that spans nearly a century, it has proved impossible to secure consensus over the meaning of validity; not even as a family resemblance concept. This was evident in the long-standing debate over reference to TVOTT, as well as in the rampant proliferation of VMLs even after they had officially been rejected. It is currently epitomized in the standoff between those who insist upon a narrow, scientific conception of validity and those who insist upon a broad, scientific and ethical one. We need now to take radical action to dissipate this tension. If we are to talk meaningfully and productively about the characteristics of quality in EPM, then we need to bypass the concept of validity. So why not just cut out the middleman and talk directly about quality? This is to recommend quality as the principal family resemblance concept for evaluation within EPM, applicable equally across the three principal foci of measurement, decision making, and testing policy. What exactly might we mean by quality in different contexts? What are the important distinctions that we need to capture, or recapture, when theorizing evaluation within EPM? Some might be tempted to consider breathing new life into the traditional lexicon, introducing quality modifier labels like content quality, predictive quality, and factorial quality. As Sireci (1998, 2007) reminded us, the rejection of VMLs has made it harder to discuss some of the important characteristics of test quality, and this could go some way to rectifying the situation. However, the very act of breathing new life into the old labels would risk reifying those concepts, in much the same way as the traditional VML formulation did. Furthermore, the adoption of certain useful quality modifier labels might open the floodgates to many far less useful ones: to summative quality, occupational quality, site-quality, extratest

NEWTON AND SHAW

16

quality, and a whole host of other dubious distinctions that have probably done more to mystify the landscape of validity theory over the decades than to clarify it. We follow the spirit of modern validity theory in preferring to think of quality, within EPM, as more holistic than fragmented, guided by three principal evaluation foci: quality of measurement, quality of decision making, and quality of testing policy.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

References Allen, M. J. (2004). Assessing academic programs in higher education. Bolton, MA: Anker. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards for educational and psychological testing. Washington, DC: Author. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: Author. American Psychological Association. (1952). Technical recommendations for psychological tests and diagnostic techniques: Preliminary proposal. American Psychologist, 7, 461– 475. doi:10.1037/h0056631 American Psychological Association, American Educational Research Association, & National Council on Measurement in Education. (1966). Standards for educational and psychological tests and manuals. Washington, DC: Author. American Psychological Association, American Educational Research Association, & National Council on Measurement in Education. (1974). Standards for educational and psychological tests. Washington, DC: Author. American Psychological Association, American Educational Research Association, & National Council on Measurements Used in Education. (1954). Technical recommendations for psychological tests and diagnostic techniques. Psychological Bulletin, 51(2, pt. 2), 1–38. doi: 10.1037/h0053479 Angoff, W. H. (1988). Validity: An evolving concept. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 19 –32). Hillsdale, NJ: Erlbaum. Austin, J. (1995). The province of jurisprudence determined (W. E. Rumble, Ed.). Cambridge, England: Cambridge University Press. doi: 10.1017/CBO9780511521546 (Original work published 1832) Bechtoldt, H. P. (1959). Construct validity: A critique. American Psychologist, 14, 619 – 629. doi:10.1037/h0040359 Bemis, S. E. (1968). Occupational validity of the General Aptitude Test Battery. Journal of Applied Psychology, 52, 240 –244. doi:10.1037/ h0025733 Bennett, R. E. (2012). Consequences that cannot be avoided: A response to Paul Newton. Measurement: Interdisciplinary Research and Perspectives, 10, 30 –32. doi:10.1080/15366367.2012.686865 Boehm, V. R. (1972). Negro–White differences in validity of employment and training selection procedures. Journal of Applied Psychology, 56, 33–39. doi:10.1037/h0032130 Borsboom, D. (2012). Whose consensus is it anyway? Scientific versus legalistic conceptions of validity. Measurement: Interdisciplinary Research and Perspectives, 10, 38 – 41. doi:10.1080/15366367.2012 .681971 Borsboom, D., Cramer, A. O. J., Keivit, R. A., Scholten, A. Z., & Franic, S. (2009). The end of construct validity. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions, and applications (pp. 135–170). Charlotte, NC: Information Age. Borsboom, D., & Mellenbergh, G. J. (2007). Test validity in cognitive assessment. In J. P. Leighton & M. J. Gierl (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 85–115).

New York, NY: Cambridge University Press. doi:10.1017/ CBO9780511611186.004 Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061–1071. doi:10.1037/0033295X.111.4.1061 Bracht, G. H., & Glass, G. V. (1968). The external validity of experiments. American Educational Research Journal, 5, 437– 474. doi:10.3102/ 00028312005004437 Brennan, R. L. (1998). Misconceptions at the intersection of measurement theory and practice. Educational Measurement: Issues and Practice, 17, 5–9. doi:10.1111/j.1745-3992.1998.tb00615.x Briggs, D. C. (2004). Comment: Making an argument for design validity before interpretive validity. Measurement: Interdisciplinary Research and Perspectives, 2, 171–174. doi:10.1207/s15366359mea0203_2 Brookhart, S. M. (2009). The many meanings of “multiple measures.” Educational Leadership, 67, 6 –12. Buckingham, B. R., McCall, W. A., Otis, A. S., Rugg, H. O., Trabue, M. R., & Courtis, S. A. (1921). Report of the Standardization Committee. Journal of Educational Research, 4, 78 – 80. Burns, W. C. (1995). Content validity, face validity, and quantitative face validity. Retrieved from http://www.burns.com/wcbcontval.htm Camara, W. J., & Lane, S. (2006). A historical perspective and current views on the Standards for Educational and Psychological Testing. Educational Measurement: Issues and Practice, 25, 35– 41. doi: 10.1111/j.1745-3992.2006.00066.x Campbell, D. T. (1957). Factors relevant to the validity of experiments in social settings. Psychological Bulletin, 54, 297–312. doi:10.1037/ h0040950 Campbell, D. T. (1960). Recommendations for APA test standards regarding construct, trait, or discriminant validity. American Psychologist, 15, 546 –553. doi:10.1037/h0048255 Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait–multimethod matrix. Psychological Bulletin, 56, 81–105. doi:10.1037/h0046016 Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasiexperimental designs for research. Chicago, IL: Rand McNally. Carver, R. P. (1974). Two dimensions of tests: Psychometric and edumetric. American Psychologist, 29, 512–518. doi:10.1037/h0036782 Cattell, R. B. (1964). Validity and reliability: A proposed more basic set of concepts. Journal of Educational Psychology, 55, 1–22. doi:10.1037/ h0046462 Cizek, G. J. (2012). Defining and distinguishing validity: Interpretations of score meaning and justification of test use. Psychological Methods, 17, 31– 43. doi:10.1037/a0026975 Cizek, G. J., Rosenberg, S. L., & Koons, H. H. (2008). Sources of validity evidence for educational and psychological tests. Educational and Psychological Measurement, 68, 397– 412. doi:10.1177/0013164407310130 Cone, J. D. (1995). Assessment practice standards. In S. C. Hayes, V. M. Follette, R. M. Dawe, & K. Grady (Eds.), Scientific standards for psychological practice: Issues and recommendations (pp. 201–224). Reno, NV: Context Press. Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Boston, MA: Houghton Mifflin. Cronbach, L. J. (1949). Essentials of psychological testing. New York, NY: Harper. Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 3–17). Hillsdale, NJ: Erlbaum. Cronbach, L. J. (1989). Construct validation after thirty years. In R. L. Linn (Ed.), Intelligence: Measurement, theory and public policy (pp. 147– 171). Urbana: University of Illinois Press. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. doi:10.1037/h0040957

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (pp. 621– 694). Washington, DC: American Council on Education. Cureton, E. E. (1965). Reliability and validity: Basic assumptions and experimental designs. Educational and Psychological Measurement, 25, 327–346. doi:10.1177/001316446502500204 Dick, W., & Hagerty, N. (1971). Topics in measurement: Reliability and validity. New York, NY: McGraw-Hill. Downing, S. M. (2003). Validity: On the meaningful interpretation of assessment data. Medical Education, 37, 830 – 837. doi:10.1046/j.13652923.2003.01594.x Dunnette, M. D. (1992). It was nice to be there: Construct validity then and now. Human Performance, 5, 157–169. doi:10.1207/ s15327043hup0501&2_9 Dunnette, M. D., & Borman, W. C. (1979). Personnel selection and classification systems. Annual Review of Psychology, 30, 477–525. doi:10.1146/annurev.ps.30.020179.002401 Ebel, R. L. (1983). The practical validation of tests of ability. Educational Measurement: Issues and Practice, 2, 7–10. doi:10.1111/j.1745-3992 .1983.tb00688.x English, H., & English, A. A. (1958). Comprehensive dictionary of psychological and psychoanalytical terms. New York, NY: Longmans, Green. Evers, A., Sijtsma, K., Lucassen, W., & Meijer, R. R. (2010). The Dutch review process for evaluating the quality of psychological tests: History, procedure, and results. International Journal of Testing, 10, 295–317. doi:10.1080/15305058.2010.518325 Fernberger, S. W. (1932). The American Psychological Association: A historical summary, 1892–1930. Psychological Bulletin, 29, 1– 89. doi: 10.1037/h0075733 Fitzpatrick, A. R. (1983). The meaning of content validity. Applied Psychological Measurement, 7, 3–13. doi:10.1177/014662168300700102 Foster, S. L., & Cone, J. D. (1995). Validity issues in clinical assessment. Psychological Assessment, 7, 248 –260. doi:10.1037/1040-3590.7.3.248 Freebody, P., & Wyatt-Smith, C. (2004). The assessment of literacy: Working the zone between “system” and “site” validity. Journal of Educational Enquiry, 5, 30 – 49. Frisbie, D. A. (2005). Measurement 101: Some fundamentals revisited. Educational Measurement: Issues and Practice, 24, 21–28. doi: 10.1111/j.1745-3992.2005.00016.x Gaylord, R. H., & Stunkel, E. R. (1954). Validity and the criterion. Educational and Psychological Measurement, 14, 294 –300. doi: 10.1177/001316445401400209 Greene, H. A., Jorgensen, A. N., & Gerberich, J. R. (1943). Measurement and evaluation in the secondary school. New York, NY: Longmans, Green. Guilford, J. P. (1946). New standards for test evaluation. Educational and Psychological Measurement, 6, 427– 438. Guilford, J. P. (1954). Psychometric methods (2nd ed.). New York, NY: McGraw-Hill. Guion, R. M. (1977a). Content validity—The source of my discontent. Applied Psychological Measurement, 1, 1–10. doi:10.1177/ 014662167700100103 Guion, R. M. (1977b). Content validity: Three years of talk—What’s the action? Public Personnel Management, 6, 407– 414. Guion, R. M. (1980). On Trinitarian doctrines of validity. Professional Psychology, 11, 385–398. doi:10.1037/0735-7028.11.3.385 Guion, R. M. (2009). Was this trip really necessary? Industrial and Organizational Psychology, 2, 465– 468. doi:10.1111/j.1754-9434.2009 .01174.x Guion, R. M. (2011). Assessment, measurement, and prediction for personnel decisions (2nd ed.). Hove, England: Routledge. Gulliksen, H. (1950). Intrinsic validity. American Psychologist, 5, 511– 517. doi:10.1037/h0054604

17

Guttman, L. (1950). The problem of attitude and opinion measurement. In S. A. Stouffer et al. (Eds.), Studies in social psychology in World War II: Vol. 4. Measurement and prediction (pp. 46 –59). Princeton, NJ: Princeton University Press. Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2002). Cluster validity methods: Part 1. SIGMOD Record, 31, 40 – 45. doi:10.1145/565117 .565124 Hambleton, R. K. (1980). Test score validity and standard-setting methods. In R. A. Berk (Ed.), Criterion-referenced measurement: The state of the art (pp. 80 –123). Baltimore, MD: Johns Hopkins University Press. Hanlon, C., Medhin, G., Alem, A., Araya, M., Abdulahi, A., Hughes, M., . . . Prince, M. (2008). Detecting perinatal common mental disorders in Ethiopia: Validation of the Self-Reporting Questionnaire and Edinburgh Postnatal Depression Scale. Journal of Affective Disorders, 108, 251– 262. doi:10.1016/j.jad.2007.10.023 Hill, H. C., Dean, C., & Gaffney, I. M. (2007). Assessing elemental and structural validity: Data from teachers, non-teachers, and mathematicians. Measurement: Interdisciplinary Research and Perspectives, 5, 81–92. doi:10.1080/15366360701486999 Hoffman, R. G., & Davis, G. L. (1995). Prospective validity study: CPI Work Orientation and Managerial Potential Scales. Educational and Psychological Measurement, 55, 881– 890. doi:10.1177/ 0013164495055005024 Hogan, T. P., & Agnello, J. (2004). An empirical study of reporting practices concerning measurement validity. Educational and Psychological Measurement, 64, 802– 812. doi:10.1177/0013164404264120 Holtzman, N. A., & Watson, M. S. (Eds.). (1997). Promoting safe and effective genetic testing in the United States: Final report of the Task Force on Genetic Testing. Retrieved from http://www.genome.gov/ 10001733 Hopwood, C. J., Baker, K. L., & Morey, L. C. (2008). Extratest validity of selected personality assessment inventory scales and indicators in an inpatient substance abuse setting. Journal of Personality Assessment, 90, 574 –577. doi:10.1080/00223890802388533 Hubley, A. M., & Zumbo, B. D. (1996). A dialectic on validity: Where we have been and where we are going. Journal of General Psychology, 123, 207–215. doi:10.1080/00221309.1996.9921273 Jolliffe, D., Farrington, D. P., Hawkins, J. D., Catalano, R. F., Hill, K. G., & Kosterman, R. (2003). Predictive, concurrent, prospective and retrospective validity of self-reported delinquency. Criminal Behaviour and Mental Health, 13, 179 –197. doi:10.1002/cbm.541 Jonson, J. L., & Plake, B. S. (1998). A historical comparison of validity standards and validity practices. Educational and Psychological Measurement, 58, 736 –753. doi:10.1177/0013164498058005002 Julnes, G. (2011). Reframing validity in research and evaluation: A multidimensional, systematic model of valid inference. In H. T. Chen, S. I. Donaldson, & M. M. Mark (Eds.), Advancing validity in outcome evaluation: Theory and practice (pp. 55– 67). Hoboken, NJ: Wiley. Kane, M. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38, 319 –342. doi:10.1111/j.1745-3984.2001 .tb01130.x Kane, M. (2004). The analysis of interpretive arguments: Some observations inspired by the comments. Measurement: Interdisciplinary Research and Perspectives, 2, 192–200. doi:10.1207/ s15366359mea0203_3 Kane, M. (2008). Terminology, emphasis, and utility in validation. Educational Researcher, 37, 76 – 82. doi:10.3102/0013189X08315390 Kane, M. (2009). Validating the interpretations and uses of test scores. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions, and applications (pp. 39 – 64). Charlotte, NC: Information Age. Kane, M. (2012). All validity is construct validity. Or is it? Measurement: Interdisciplinary Research and Perspectives, 10, 66 –70. doi: 10.1080/15366367.2012.681977

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

18

NEWTON AND SHAW

Kane, M. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50, 1–73. doi:10.1111/jedm .12000 Karelitz, T. M., Parrish, D. M., Yamada, H., & Wilson, M. (2010). Articulating assessments across childhood: The cross-age validity of the Desired Results Developmental Profile–Revised. Educational Assessment, 15, 1–26. doi:10.1080/10627191003673208 Kvale, S. (1995). The social construction of validity. Qualitative Inquiry, 1, 19 – 40. doi:10.1177/107780049500100103 Landy, F. L. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist, 41, 1183–1192. doi: 10.1037/0003-066X.41.11.1183 Larsen, K. R., Nevo, D., & Rich, E. (2008). Exploring the semantic validity of questionnaire scales. In R. H. Sprague, Jr. (Ed.), Proceedings of the 41st Hawaii International Conference on System Sciences [CD]. Washington, DC: IEEE Computer Society. doi:10.1109/HICSS.2008.165 Lather, P. (1986). Issues of validity in openly ideological research: Between a rock and a hard place. Interchange, 17, 63– 84. doi:10.1007/ BF01807017 Lather, P. (1993). Fertile obsession: validity after poststructuralism. The Sociological Quarterly, 34, 673– 693. Lawshe, C. H. (1952). Employee selection. Personnel Psychology, 5, 31–34. doi:10.1111/j.1744-6570.1952.tb00990.x Lawshe, C. H. (1985). Inferences from personnel tests and their validity. Journal of Applied Psychology, 70, 237–238. Lennon, R. T. (1956). Assumptions underlying the use of content validity. Educational and Psychological Measurement, 16, 294 –304. doi: 10.1177/001316445601600303 Lievens, F., Buyse, T., & Sackett, P. R. (2005). The operational validity of a video-based situational judgment test for medical college admission: Illustrating the importance of matching predictor and criterion construct domains. Journal of Applied Psychology, 90, 442– 452. doi:10.1037/ 0021-9010.90.3.442 Lindquist, E. F. (1936). The theory of test construction. In H. E. Hawkes, E. F. Lindquist, & C. R. Mann (Eds.), The construction and use of achievement examinations: A manual for secondary school teachers (pp. 17–106). Cambridge, MA: Riverside Press. Linn, R. L. (1978). Single-group validity, differential validity, and differential prediction. Journal of Applied Psychology, 63, 507–512. doi: 10.1037/0021-9010.63.4.507 Linn, R. L. (1997). Evaluating the validity of assessments: The consequences of use. Educational Measurement: Issues and Practice, 16, 14 –16. doi:10.1111/j.1745-3992.1997.tb00587.x Lissitz, R. W., & Samuelsen, K. (2007). A suggested change in terminology and emphasis regarding validity and education. Educational Researcher, 36, 437– 448. doi:10.3102/0013189X07311286 Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3(Suppl. 9), 635– 694. doi:10.2466/pr0.1957 .3.3.635 Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. MacPhail, F. (1998). Moving beyond statistical validity in economics. Social Indicators Research, 45, 119 –149. doi:10.1023/A: 1006989612799 Maguire, T., Hattie, J., & Haig, B. (1994). Construct validity and achievement assessment. Alberta Journal of Educational Research, 40(2), 109 – 126. Maraun, M. D., Slaney, K. L., & Gabriel, S. M. (2009). The Augustinian methodological family of psychology. New Ideas in Psychology, 27, 148 –162. doi:10.1016/j.newideapsych.2008.04.011 Marcoulides, G. A. (2004). Conceptual debates in evaluating measurement procedures. Measurement: Interdisciplinary Research and Perspectives, 2, 182–184. doi:10.1207/s15366359mea0203_2

Markus, K. A. (2012). Constructs and attributes in test validity: Reflections on Newton’s account. Measurement: Interdisciplinary Research and Perspectives, 10, 84 – 87. doi:10.1080/15366367.2012.677348 Markus, M. L., & Robey, D. (1980). The organizational validity of management information systems. Cambridge, MA: Massachusetts Institute of Technology, Center for Information Systems Research. Maxwell, J. A. (1992). Understanding and validity in qualitative research. Harvard Educational Review, 62, 279 –300. McCrae, R. R. (1982). Consensual validation of personality traits: Evidence from self-reports and ratings. Journal of Personality and Social Psychology, 43, 293–303. doi:10.1037/0022-3514.43.2.293 McIntire, S. A., & Miller, L. A. (2007). Foundations of psychological testing: A practical approach (2nd ed.). Thousand Oaks, CA: Sage. Mehrens, W. A. (1997). The consequences of consequential validity. Educational Measurement: Issues and Practice, 16, 16 –18. doi: 10.1111/j.1745-3992.1997.tb00588.x Messick, S. (1975). The standard problem: Meaning and values in measurement and evaluation. American Psychologist, 30, 955–966. doi: 10.1037/0003-066X.30.10.955 Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012–1027. doi:10.1037/0003-066X.35.11.1012 Messick, S. (1981). Evidence and ethics in the evaluation of tests. Educational Researcher, 10, 9 –20. doi:10.3102/0013189X010009009 Messick, S. (1988). The once and future issues of validity: Assessing the meaning and consequences of measurement. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 33– 48). Hillsdale, NJ: Erlbaum. Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). Washington, DC: American Council on Education. Messick, S. (1992). Validity of test interpretation and use. In M. C. Alkin (Ed.), Encyclopedia of educational research (6th ed., Vol. 4, pp. 1487– 1495). New York, NY: Macmillan. Messick, S. (1998). Test validity: A matter of consequences. Social Indicators Research, 45, 35– 44. doi:10.1023/A:1006964925094 Miller, M. D., Linn, R. L., & Gronlund, N. E. (2009). Measurement and assessment in teaching (10th ed.). Upper Saddle River, NJ: Pearson Education. Mosier, C. I. (1947). A critical examination of the concepts of face validity. Educational and Psychological Measurement, 7, 191–205. doi:10.1177/ 001316444700700201 Moss, P. A. (1995). Themes and variations in validity theory. Educational Measurement: Issues and Practice, 14, 5–13. doi:10.1111/j.1745-3992 .1995.tb00854.x Murphy, K. R. (2009). Content validation is useful for many things, but validity isn’t one of them. Industrial and Organizational Psychology, 2, 453– 464. doi:10.1111/j.1754-9434.2009.01173.x Nevo, B. (1985). Face validity revisited. Journal of Educational Measurement, 22, 287–293. doi:10.1111/j.1745-3984.1985.tb01065.x Newton, P. E. (2012a). Clarifying the consensus definition of validity. Measurement: Interdisciplinary Research and Perspectives, 10, 1–29. doi:10.1080/15366367.2012.669666 Newton, P. E. (2012b). Questioning the consensus definition of validity. Measurement: Interdisciplinary Research and Perspectives, 10, 110 – 122. doi:10.1080/15366367.2012.688456 Pollitt, A. (2012). Validity cannot be created, it can only be lost. Measurement: Interdisciplinary Research and Perspectives, 10, 100 –103. doi: 10.1080/15366367.2012.686868 Popham, W. J. (1978). Criterion-referenced measurement. Englewood Cliffs, NJ: Prentice-Hall. Popham, W. J. (1997). Consequential validity: Right concern—wrong concept. Educational Measurement: Issues and Practice, 16, 9 –13. doi:10.1111/j.1745-3992.1997.tb00586.x Reynolds, C. R., Livingston, R. B., & Willson, V. (2010). Measurement and assessment in education (2nd ed.). Upper Saddle River, NJ: Pearson.

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

STANDARDS FOR TALKING AND THINKING ABOUT VALIDITY Richardson, M. W. (1936). The relation between the difficulty and the differential validity of a test. Psychometrika, 1, 33– 49. doi:10.1007/ BF02288003 Rosenberg, M. (1979). Conceiving the self. New York, NY: Basic Books. Rulon, P. J. (1946). On the validity of educational tests. Harvard Educational Review, 16, 290 –296. Scriven, M. (2002). Assessing six assumptions in assessment. In H. I. Braun, D. N. Jackson, & D. E. Wiley (Eds.), The role of constructs in psychological and educational measurement (pp. 255–275). Mahwah, NJ: Erlbaum. Sechrest, L. (1963). Incremental validity: A recommendation. Educational and Psychological Measurement, 23, 153–158. doi:10.1177/ 001316446302300113 Shaw, D. J., & Linden, J. D. (1964). A critique of the Hand Test. Educational and Psychological Measurement, 24, 283–284. doi: 10.1177/001316446402400209 Shaw, S., & Weir, C. J. (2007). Examining writing: Research and practice in assessing second language writing. Cambridge, England: Cambridge University Press. Shepard, L. A. (1993). Evaluating test validity. Review of Research in Education, 19, 405– 450. Shepard, L. A. (1997). The centrality of test use and consequences for test validity. Educational Measurement: Issues and Practice, 16, 5–24. doi:10.1111/j.1745-3992.1997.tb00585.x Sireci, S. G. (1998). The construct of content validity. Social Indicators Research, 45, 83–117. doi:10.1023/A:1006985528729 Sireci, S. G. (2007). On validity theory and test validation. Educational Researcher, 36, 477– 481. doi:10.3102/0013189X07311609 Sireci, S. G. (2009). Packing and unpacking sources of validity evidence: History repeats itself again. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions, and applications (pp. 19 –37). Charlotte, NC: Information Age. Smith, G. T. (2005). On construct validity: Issues of method and measurement. Psychological Assessment, 17, 396 – 408. doi:10.1037/1040-3590 .17.4.396 Tenopyr, M. L. (1986). Needed directions for measurement in work settings. In J. V. Mitchell, Jr. (Series Ed.) & B. S. Plake & J. C. Witt (Vol. Eds.), Buros-Nebraska Symposium on Measurement and Testing: Vol. 2. The future of testing (pp. 269 –288). Hillsdale, NJ: Erlbaum.

19

Thurstone, L. L. (1931). The reliability and validity of tests: Derivation and interpretation of fundamental formulae concerned with reliability and validity of tests and illustrative problems. Ann Arbor, MI: Edwards. doi:10.1037/11418-000 Trochim, W. M. (2006). The research methods knowledge base (2nd ed.). Retrieved from http://www.socialresearchmethods.net/kb/ Tryon, R. C. (1957a). Communality of a variable: Formulation by cluster analysis. Psychometrika, 22, 241–260. doi:10.1007/BF02289125 Tryon, R. C. (1957b). Reliability and behavior domain validity: Reformulation and historical critique. Psychological Bulletin, 54, 229 –249. doi: 10.1037/h0047980 Waluchow, W. J. (2009). Four concepts of validity: Reflections on inclusive and exclusive positivism. In M. D. Adler & K. E. Himma (Eds.), The rule of recognition and the United States Constitution (pp. 123– 143). Oxford, England: Oxford University Press. doi:10.1093/acprof: oso/9780195343298.003.0005 Watson, G., & Forlano, G. (1935). Prima facie validity in character tests. Journal of Educational Psychology, 26, 1–16. doi:10.1037/h0057103 Willcutt, E. G., & Carlson, C. L. (2005). The diagnostic validity of attention-deficit/hyperactivity disorder. Clinical Neuroscience Research, 5, 219 –232. doi:10.1016/j.cnr.2005.09.003 Wolming, S., & Wikstrom, C. (2010). The concept of validity in theory and practice. Assessment in Education: Principles, Policy and Practice, 17, 117–132. Woody, C. (1935). A symposium on the effects of measurement on instruction. Journal of Educational Research, 28, 481– 483. Yalow, E., & Popham, W. J. (1983). Content validity at the crossroads. Educational Researcher, 12, 10 –21. doi:10.3102/0013189X012008010 Zumbo, B. D. (2009). Validity as contextualized and pragmatic explanation, and its implications for validation practice. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions, and applications (pp. 65– 82). Charlotte, NC: Information Age.

Received May 22, 2012 Revision received March 26, 2013 Accepted April 7, 2013 䡲