Empirical Validation of Criterion-Referenced Tests

EmpiricalValidation of Criterion-ReferencedTests GERALD TINDAL LYNN S. FUCHS DOUGLAS FUCHS MARK R. SHINN STANLEY L. DENO GARY GERMANN University or Minnesota

velopment of procedures for generating appropriate samples of tests (Goodstein, 1982; Hambleton, Swaminathan, Algina, & Coulson, 1978; Popham, 1980). All three components stress the edumetric and psychometric properties of criterion-referenced tests. Nevertheless, the focus in publishing houses and schools has been only on the practical implications of the criterionreferenced test. With the recognition that it can provide relevant data for describing student progress with respect to specific learning objectives, criterion-referenced test use has proliferated. Test developers have marketed these instruments along with objective banks, and teachers have developed such instrum~nts to fit their own individual learning objectives. Unfortunately, investigation of the reliability and validity of these measures has not kept pace with their development and use. Even among published criterion-referenced tests, there is scant empirical support for technical adequacy. Inspection of 12 commercial criterion-referenced tests revealed that only 4 test manuals addressed reliability and validity at all, and authors of only 2 instruments investigated more than one aspect of test adequacy (Tindal, Shinn, Fuchs, Fuchs, Deno, & Germann, 1983). Similarly, within the area of reading instruction, publishers have developed criterion-referenced tools for assessing mastery within their basal series. Such basal mastery tests are used widely in the schools for determining when to progress students through curricula. Although these tests frequently are isomorphic with respect to reading cur-

ABSTRACT This study examined the test-retest reliability and criterion validity of basal mastery tests of three widely used commercial reading series. Traditional correlational analyses as well as statistical procedures developed specifically for criterion-referenced tests were employed. Results indicated that the reliability and validity of the basal mastery tests varied among and within instruments. Additional analyses revealed a strong relation between traditional and criterion-referenced strategies for describing test-retest reliability, but a more modest relation between traditional and alternative analyses of criterion validity. Implications for developing and using basal mastery tests and describing their adequacy are discussed.

N

orm-referenced achievement testing is the traditional and predominant measurement strategy for evaluating and documenting educational program effects. However, there is growing recognition that norm-referenced measurement may be inadequate for this purpose: it has poor curricular validity (McClung cited in Yalow & Popham, 1983), and fails to indicate the extent to which specific educational objectives have been mastered (Skager, 1971). As an alternative to traditional measurement, in the past two decades criterion-referenced testing has received increasing attention from measurement theorists, test developers, and school personnel. As defined by Glaser and Nitko (1971), the criterion-referenced test is a sample of items yielding information that is interpretable directly with respect both to a well-defined domain of tasks and to specified performance standards. This definition reflects three characteristics that frequently are employed in the literature to describe criterion-referenced measurement: ( l) definition of a well-specified content domain (Baker, 1974; Hambleton & Novick, 1973; Millman, 1972); (2) delineation of valid performance criteria (Hambleton, 1980); and (3) de-

At the time data were collected. authors were affiliated with the University of Minnesota Institute for Research on Learning Disabilities. Currently, Gerald Tindal and Gary Germann are with the Pine County Public School Cooperative, Lynn S. Fuchs is at Wheelock College, Douglas Fuchs is at Clark University, Mark R. Shinn is with the Minneapolis Public Schools. and Stanley L. Deno is at the University of Minnesota. Address correspondence to Douglas

Fuchs, Dt?partmt?nt of Education,

Worcester, MA 01610. 203

Clark Univusity,

204

ricula and, as such, appear to be useful, there is little evidence that such measurement is accurate or meaningful. Therefore, there is a need for examination of the psychometric properties of these tests. Traditional ways of investigating reliability and validity have been criticized as inappropriate for mastery instruments (Popham & Husek, 1969). Because homogeneous distributions of test scores should be centered at the low and high ends of the measurement scale, respectively representing pre- and post-instruction performance, variance of test scores probably is restricted and correlational estimates of reliability and validity should tend to be low (Hambleton & Novick, 1973). Therefore, alternative analyses for investigating the adequacy of mastery tests have been developed (Berk, 1980). In contrast to the correlation statistic, these analyses rely minimally on the notion that inter-individual variability is necessary (Carver, 1970; Hambleton & Novick, 1973; Huynh, 1976; Subkoviak, 1975). Nevertheless, despite the development of such analyses, constructors of commercial mastery tests, when they do address technical adequacy, still rely on traditional correlational analyses (Tindal et al., 1983 ). The purposes of this study were twofold. First, the investigation was designed to describe the reliability and validity of three specific basal mastery tests, each from a widely used basal reading series-Houghton Mifflin, Ginn 720, and Scott, Foresman. Despite widespread use of these basal series, there are few if any reports concerning the adequacy of their criterion-referenced tests. (None of the three basal mastery tests is accompanied by reliability or validity data. Additionally, despite authors' attempts through correspondence to obtain relevant information, none of the publishing houses provided meaningful data concerning anything more than face or curricular validity.) The investigation of the tests' reliability and validity should provide information of interest not only to users of these measures but also to users of other basal mastery tests for which technical data are still unavailable. The second purpose of the study was to compare results based on traditional and alternative approaches to studying the technical adequacy of criterion-referenced instruments. Such a comparison should shed light on the appropriateness and potential usefulness of each strategy.

Method Subjects. There were three pools of subjects, one for each basal mastery test examined. In each case, subjects were intermediate age students in a rural midwestern educational cooperative. For the Houghton-Mifflin test, subjects were 47 students (20 M, 27 F) who constituted two sixth-grade classrooms in two school districts. Their mean percentile rank on the Science Research Associates Reading Achievement Test (SRA) was 51.5 (SD = 18.1). For the Ginn 720 test, subjects were 47 students (27 M, 20 F), representing two fifth-grade classes in two school districts. They scored

Journal of Educational Research

a mean percentile rank of 45.l (SD = 17.8) on the SRA. Subjects included for the Scott, Foresman test were 25 fourth graders (13 M, 12 F) from one class. Their mean percentile rank on the SRA was 65.0 (SD = 19.3). Measures. Three types of reading performance measures were used in the study: basal mastery tests, a global normreferenced test, and a curriculum-based word reading test. Basal mastery tests. Three basal mastery tests were employed. The End-of-Level 11 Basic Reading Test (Brzeinski & Shoephoerster, 1974) of the Houghton-Mifflin reading series comprises three scales: decoding skills, comprehension skills, and reference/study skills. Each scale includes two to three subtests, each of which encompasses 6 to 12 items, with a mastery cutoff score between 83% and 85% correct responses. (See Tindal et al., 1983, for descriptions of the subtests). Four scales of the End-of-Level 11 Mastery Test (Clymer, Blanton, Johnson, & Lapp, 1980) of the Ginn 720 and Ginn 720 Rainbow Edition reading series were employed. Each of these scales-comprehension, vocabulary, decoding, and study skills-comprises two subtests, with 6 to 25 items per subtest. Mastery is set between 79% and 85% correct responses. (See Fuchs, Tindal, Shinn, Fuchs, Deno, & Germann, 1983, for descriptions of the subtests). The End-of-Book lO Test (Johns, 1981) of the Scott, Foresman reading series encompasses four scales-word identification, comprehension, study and research, and literary understanding and appreciation-each of which comprises subtests. This test includes between 12 and 43 items per scale. Mastery scores are established for the four scales only, and range from 79% to 83% correct responses. For this study two subtests (fiction/nonfiction, from the comprehension scale, and summarizing, from the study and research scale) were omitted for logistical reasons, resulting in 12 to 41 items per scale with mastery scores between 76% and 83% correct responses. (See Tindal, Fuchs, Fuchs, Shinn, Deno, & Gennann, 1983, for descriptions of the subtests). Norm-referenced test. The SRA (Naslund, Thorpe, & Lefever, 1978) comprises two subtests: vocabulary and comprehension. In the vocabulary section, examinees are required to select, from four alternatives, a synonym for an underlined word in a sentence. In the comprehension section, examinees read 200-300 word passages and answer questions in multiple choice format. Total test score is based on a linear combination of the two subtests. Internal consistency reliability was reported at .88 (Salvia & Ysseldyke, 1981 ). Curriculum-based word reading test. The Word Reading Test (Deno, Mirkin, & Chiang, 1982) requires children to read aloud passages and isolated word lists, each for a oneminute period. It is scored in terms of average numbers of words correct over two alternate forms of the isolated word reading and passage reading scales. The 200-word passages are drawn randomly from a student's grade appropriate

March/April

205

1985 (Vol. 78(No. 4)1

basal reading book; the 150-word lists sample words randomly from the basals, with 60% of the words drawn from the student· s grade appropriate level and 40% sampled equally from all previous levels. For the passage and isolated Word Reading Test, test-retest and alternate form reliabilities were at least .90 (Fuchs, Deno, & Marston, 1983; Fuchs. Deno, & Mirkin, 1984; Fuchs, Wesson, Tindal, Mirkin, & Deno, 1981). Procedure. All students were tested in class groups, by school psychologists for the SRA, and by their classroom teachers on the basal mastery tests. The Word Reading Test was administered individually by trained aides. Standardized administration procedures were followed on all tests. Testing time ranged from 60 to 90 minutes for the SRA, 60 to 90 minutes for each basal mastery test, and 5 to 6 minutes for the Word Reading Test. All testing was completed within two weeks. For the Houghton-Mifflin Test (HMT), a subgroup of 20 students ( 11 M, 9 F) was administered the measures in the following order: HMT, SRA, Word Reading Test, and HMT again. The remaining 27 students were given each test once with the order of administration random. Similarly. for the Ginn Test (GT). a subgroup of 22 students ( 12 M. 10 F) was administered the GT, the SRA, the Word Reading Test, and the GT again. The remaining 23 students were given each test once in random order. For the Scott, Foresman Test (SFT), all subjects were administered the following tests in the following order: the SFT, the SRA, the Word Reading Test, and the SFT again. Data Analysis Consistency of performance on two administrations of the same test. Consistency of student performance on each basal mastery test was assessed in two ways. In each analysis, the sets of students who had been tested twice on the basal mastery tests were the subjects. First, traditional testretest reliability was determined by correlating scores from two administrations of the same test. The second analysis was designed specifically for criterion-referenced measures (see Millman, 1974 ), and was calculated for HMT and GT subtest and scale scores and for SFT scale scores. The difference between observed and chance proportions of agreements in mastery decisions between two administrations of the same test was divided by the maximum value that difference could assume. The chance proportion of agreements was computed by multiplying and then summing the marginal proportions of the same decision categories for the two administrations, as done in a chi-square test of asso-+ ciation. This ratio (where the numerator is the observed minus the chance proportions of agreements and the denominator is the maximum value that such a difference can assume) can be interpreted as the proportion of the total number of possible agreements that actually was achieved above the chance level. The ratio hereafter is referred to as the ··corrected proportion.'·

Criterion validity. The criterion validity of each basal mastery test was deterrnined in two ways, each using the entire group of subjects for each series. The traditional psychometric strategy of correlating scores on the measure of interest (here, each basal mastery test) with each criterion measure was used. The SRA and the Word Reading Test, the more global tests of reading proficiency, were employed as the criterion measures. In the alternative approach to criterion validity (see Millman, 1974), chi-square statistical tests were applied to contingency tables wherein mastery/ nonmastery decisions for each basal test represented one dimension of each table, and pre-post instructional status within the corresponding series represented the other dimension. Pre-post instructional status indicates whether the student already had received instruction in the text corresponding to the basal mastery test. Percentages of misclassifications and phi coefficients supplemented the chi-square tests. Relations berween traditional and alternative analyses. Finally, to examine the relation between the traditional psychometric and alternative criterion-referenced approaches to investigating the technical aspects of criterion-referenced tests, correlations were run. First, the test-retest reliability coefficients were correlated with the corrected proportions of examinees placed into the same decision category on two administrations of the same test. Second, the mean criterion validity correlation coefficients between scores on the basal mastery test and the SRA and Word Reading Test scores were correlated with the phi coefficients that were based on the relation between r;nastery-nonmastery decisions and pre-post instructional status. Results Performance consistency across administrations of the same test. Test-retest reliability coefficients (the analysis) and corrected proportions of examinees placed into the same decision categories on two administrations of each basal mastery test (the criterion-referenced analysis) are displayed in Table I. For the HMT, test-retest reliability coefficients were an average .28 on the decoding subtests and scale, an average .45 for the comprehension subtests and scale, and an average .92 for the study/reference skills subtest and scale. For the total HMT score, test-retest reliability was . 90. Corrected proportions of consistently placed examinees were similar, with an average proportion of .06 for the decoding scale, an average proportion of .34 for the comprehension scale, and an average proportion of . 70 for the study/reference skills scale. For the GT, correlations and corrected proportions of consistently placed students typically were higher. For the comprehension, vocabulary, decoding, and study skills scales, respectively. mean test-retest reliability coefficients were .91, .93, .88, and .61, and mean corrected proportions of examinees for related subtests were .84, .73, .94, and

Journal of Educational Research

206

Table 1.-Test-Retest Reliability Coefficients and Corrected Proportions of Examinees in the Same Decision Categories on Two Basal Mastery Tests Administrations

Basal Mastery Test

Houghton-Mifflin total

N 20

Decoding word attack pronunciation Comprehension literal interpretative thinking meaning acquisition Study/reference skills information locating information appraising information organizing

Ginn total

22

Comprehension literal inferential Vocabulary word meaning context clues Decoding prefixes suffixes Study skills re-spellings/accents parts of an outline

Seo//, Foresman total Word identification comprehension study and research literary understanding/appreciation

* **

25

Reliability Coefficient

.90** .21 .42 .20 .72 .61 .03 .83* .94** .94** .86* .93** .97** .93** .93** .86* .97** .91** .90** .90** .90** .84* .69 .49 .64 .98** .93** .92** .93** .68

Corrected Proportion

-.06 . 18

.57 -.15 .62 .64 .69 .78

.77 .91 .65 .81 1.00

.88 .29 .61 .%

.92 .84 .57

Acceptable for group decision making. Acceptable for individual decision making.

.45. For the total GT score, test-retest reliability was .97. Figures for scale scores only are appropriate for the SFT, since mastery decisions are calculated only on scales. For the word identification, comprehension, study and research, and literary understanding/appreciation scales, respectively, correlation coefficients were .93, .92, .93, and .68, and corrected proportions were .96, .92, .84, and .57. Across the total SFT score, test-retest reliability was .98. Criterion validity. Correlations between the basal mastery tests and the SRA and Word Reading Test scores are shown in Table 2. Table 3 displays chi-square and related statistics, indicating the relation between mastery/nonmastery decisions on the basal mastery tests and pre-post instructional status in the corresponding reading series. For the HMT, mean correlations with the SRA were .44, .50, and .62 for the decoding, comprehension, and study and reference skills subtests and scales, and .68 for the total test score. With the Word Reading Test, mean correlations were .37, .51, .57, and .61, respectively. The mean phi coefficients were .25, .24, .50, and .36, respectively. For the GT comprehension, vocabulary, decoding, and study skills subtests and scales, respectively, mean correlations with the SRA were .58, .57, .49, and .53. The mean correlation between the total GT and SRA scores was .68.

With the Word Reading Test, mean correlations were .61 for comprehension, . 79 for vocabulary, .59 for decoding, .48 for study skills, and .82 for the total score. Mean phi coefficients for comprehension, vocabulary, decoding, and study skills, respectively, were .69, . 72, .54, and .51. The phi coefficient for the total GT was .74. Mean correlations between the SFT scales and the SRA ran from .54 for literary understanding/appreciation to .89 for study and research. Between the SFT scales and the Word Reading Test, correlations ranged from .56 for word identification and .75 for study and research. Phi coefficients fell between .15 for literary understanding/appreciation and .65 for study and research. On the total SFT, mean correlations were .94 and .81 for the SRA and Word Reading Test, respectively. The phi coefficient was .65 .

Approaches to describing criterion-referenced adequacy. The relation between traditional and alternative descriptions of criterion-referenced test adequacy was investigated by correlating the indices generated by traditional analyses with those generated by alternative procedures. The correlation between test-retest reliability coefficients and corrected proportions of examinees was .90. For subtests, scales, and total test scores with test-retest reliabilities of at least .90, the criterion coefficient for making decisions

March/ April 1985 (Vol. 78(No. 4)1

207

Between Basal Mastery Tests and SRA and Word Reading Test Scores

Table 2.-Correlations

Word Reading Test

SRA Test

Basal Mastery Test

Houghton-Miffiin total Decoding word attack pronunciation Comprehension literal interpretative thinking meaning acquisition Study/reference skills information locating information appraising information organizing

47

Ginn total Comprehension literal inferential Vocabulary word meaning context clues Decoding prefixes suffixes Study skills re-spellings/accents parts of an out Iine

47

Seo//. Foresman total word identification comprehension study and research literary understanding/appreciation

25

Isolated Word

Passage

x,

47

.57 .36 .27 .33 .55 .41 .33 .55 .57 .53 .48 .52

.65 .47 .31 .45 .66 .50 .37 .67 .65 .64 .59 .57

.65 .46 .35 .42 .65 .52 .30 .68 .64 .62 .55 .52

.55 .53 .38 .50 .47 .48 .38 .44 .30 .47 .48 .41 .39

47

.80 .69 .37 .71 .81 .82 .67 .65 .44 .68 .52 .31 .52

.83 .72 .45 .72 .85 .84 .74 .65 .44 .69 .58 .33 .59

.74 .67 .46 .65 .69 .70 .58 .58 .40 .60 .58 .41 .53

.95 .62 _86 .89 .55

25

.77 .42 .52 .73 .58

.84 .70 .70 .76 .55

.88 .59 .75 .83 .55

Vocabulary

Comprehension

Total

.69 .48 .40 .42 .70 .52 .35 .73 .69 .67 .58 .54

.66

.49 .38 .49 .64 .61 .19 .70 .63 .63 .55 .47

.68 .49 .40 .43 .69 .57 .26 .75 .65 .65 .53 .51

.78 .70 .52 .65 .70 .73 .52 .65 .48 .66 .69 .54 .60

.72 .73 .58 .66 .64 .63 .57 .51 .36 .52 .64 .48 .55

.92 .57 .80 .90 .48

.94 .62 .86 .87 .59

The purpose of this study was twofold. First, it was designed to describe the reliability and validity of three cri-

ditional correlational analyses and alternative strategies developed specifically for criterion-referenced instruments, this investigation sought to assess the appropriateness and potential usefulness of each strategy. For the HoughtonMifflin End-of-Level 11 Mastery Test (HMT), the Ginn 720 End-of-Level 11 Mastery Test (G), and the Scott, Foresman End-of-Book 10 Test (SFT), this study addressed two aspects of technical adequacy: the consistency of student perfonnance on two administrations of each test, and the criterion validity of each test. The test-retest consistency analyses indicated that the HMT was less than adequate. Whereas the test-retest correlations and corrected proportions for the total test and the study/reference skills scale were high, indices for the decoding and comprehension scales, which teachers may consider critical for formulating decisions about reading proficiency, were unacceptably low. Results indicated that testretest consistency was more acceptable for the GT and SFT. with average figures falling below .70 for only the study skills and the literary understanding/appreciation scales. Reliability was most consistently high for the SFT. This may be because the SFT limits computation of mastery

terion-referenced

scores

about individuals (Salvia & Ysseldyke, 1981), corrected proportions of examinees ranged from .65 to 1.00. For measures with correlation coefficients of at least .80, the criterion for making decisions about groups (Salvia & Ysseldyke, 1981), corrected proportions were between .62 and .91. For indices with correlations below .80, corrected proportions ranged from - .15 to .61. The correlation was .56 between (a) mean correlations for each subtest, scale, and total score calculated across the five criterion measures (SRA vocabulary, comprehension, and total, and Word Reading Test isolated words and passages); and (b) phi coefficients indicating the relation between mastery/nonmastery test classification and pre-post instructional status. For average correlations above . 70, phi coefficients were from .32 to . 77, with a median coefficient of .69; for mean correlations below . 70, phi coefficients ranged from .13 to . 72 with a median coefficient of .40. Discussion

basal mastery

tests.

Second,

by exam-

ining the technical adequacy of these tests, with both tra-

to the test scales and total test, and thereby

sum-

marizes information across relatively large samples of be-

Journal of Educational

208

Table 3.-Relation

Research

Between Basal Mastery Test Classification and Instructional Status Classification

N

xi

4>

Percentage Misclassified

Houghton-Mifflin total Decoding word attack pronunciation Comprehension literal interpretative thinking meaning acquisition Study/reference skills information locating information appraising information organizing

47

5.8 5.1 2.3 1.8 5.1 1.5 0.8 4.6 11.7 5.4 20.7 I 1.5

.36 .33 .22 .20 .33 .18 .13 .32 .50 .34 .67 .50

33 32 43 38 32 40 43 34 23 32 15 23

Ginn total Comprehension literal inferential Vocabulary word meaning context clues Decoding prefixes suffixes Study skills re-spellings/accents parts of an outline

47

24.9 24.4 17.1 23.9 23.9 27.4 20.4 18.9 12.8 22.0 13.0 13.8 9.8

.74 .73 .61 .72 .72 .77 .67 .64 .29 .69 .53 .55 .46

15 15 22 20 20 20 33 37 46 35 22 24 26

Sco11-Foresman total word identification comprehension study and research literary understanding/appreciation

25

9.8 0.7 6.9 9.8 0.5

.65 .17 .55 .65 .15

22 48 26 22 35

Basal Mastery Test

havior. As the Spearman-Brown formula indicates, when the number of test items increases, reliability improves correspondingly. With respect to the criterion validity aspects of test adequacy, results were less clear. Traditional criterion validity coefficients between basal mastery tests and other tests of reading achievement typically fell in the .50s for the HMT and the GT. Previous investigations (Deno, Mirkin, & Chiang, 1982; Fuchs & Deno, 1981) of the relation between curriculum-based and more global reading achievement tests have documented coefficients above .70; therefore, the HMT and GT coefficients appear to be less than adequate. Traditional criterion validity analyses suggested that the SFT, with coefficients in the .70s, was superior to the other basal mastery tests. However, the alternative criterion-referenced strategy of investigating the relation between the basal tests' mastery classification and actual instructional status classification indicated that the GT was superior to the HMT and SFT; it tended to misclassify a relatively low percentage of children. However, percentages of students misclassified varied greatly across subtests and scales within each basal mastery test, with percentages of misclassified students at times rising above 40. Additionally, it should be noted that some misclassification is due to incorrect instructional classification by teachers rather than to incorrect test classification. Consequently, this study documents high reliability for

the total test scores in the basal mastery tests, indicating that teachers might have confidence in the accuracy of total test scores and rely on them for the purpose of deciding when to advance students through reading curricula. Nevertheless, reliability for subtests varied among and within basal mastery tests and suggests that teachers' frequent use of basal mastery tests for describing skill deficiencies appears to be invalid_ It may well be that subtest analyses should be eliminated, as done in the SFT. Additionally, criterion validity coefficients of the basal mastery tests appear to be sufficiently low to raise serious questions concerning their adequacy as achievement measures. While users of basal mastery tests might employ results of this study to help interpret basal mastery test scores, a fundamental conclusion of this study is that educators should exercise caution when using basal mastery tests for which psychometric data are unavailable. Although such tests may possess high curricular and face validity, their meaningfulness and accuracy remain empirical questions: this issue frequently is ignored by basal mastery test developers and users. By exploring the reliability and validity of three such tests, this study underscores the importance (a) of the notion that curricular validity is a necessary, but insufficient, aspect of criterion-referenced test adequacy; and (b) of investigating the reliability and validity of each criterion-referenced test as it is developed. The second purpose of this study was to compare the

March/ April 1985 (Vol. 78(No. 4)1

appropriateness and usefulness of traditional psychometric analyses with strategies developed for criterion-referenced tests. Both reliability analyses addressed test-retest stability, and results corroborated each other. The correlation between the traditional test-retest coefficients and the corrected proportions of examinees placed into the same decision category was .90. The widely accepted cut-off point of .80 for making decisions about groups consistently corresponded to a cut-off point of .62 for the corrected proportions. Yet, these reliability analyses provided qualitatively different information, with the traditional analyses focusing on the consistency of test scores and the alternative analysis focusing on the consistency of mastery/nonmastery decisions. Therefore, results of this study suggest that traditional and alternative reliability analyses yield complementary information to describe test adequacy. Additionally, findings indicate that a corrected proportion of approximately .60 may provide a tentative cut-off point for acceptable criterion-referenced reliability levels. Given the use of such a standard, alternative and traditional reliability analyses yield similar decisions concerning the adequacy of criterion-referenced tests. For the criterion validity analyses, results were less clear. The correlation between traditional and alternative sets of statistics was a more modest .56. Furthermore, given an arbitrary cut-off for acceptable traditional criterion validity of. 70, there was considerable overlap in corresponding phi coefficients. This inconsistency between the two analyses may be due to the fact that whereas the traditional analysis focused on concurrent validity among test scores, the alternative, criterion-referenced analysis, examined the relation between basal test mastery decisions and a truer criterion; that is, pre-post instructional status, which indicates whether or not students actually had received instruction in the textbook corresponding to the test. Finally, the findings presented here must be interpreted with caution due to the relatively small numbers of subjects employed for each basal mastery test. Nevertheless, this study appears to represent an important step in describing the reliability and validity of basal mastery tests, and in exploring procedures for arriving at acceptable descriptions of, and decisions about, criterion-referenced test adequacy. REFERENCES Baker, E. L. ( 1974). Beyond objectives: Domain-referenced tests for evaluation and instructional improvement. Educarional Technology, 14. I0-16. Berk, R. A. ( 1980). A consumer·s guide to criterion-referenced test reliability. Journal of Educarional Measuremenl. /7(4). 323-349. Brzeinski, J .• & Schoephoerster. H. ( 1974). Basic reading ft'sls for images. Boston: Houghton-Mifnin. Carver, R. P. ( 1970). Special problems in measuring change with psychometric devices. Evaluarion research: S1rategies and methods. Pittsburgh: American Institute for Research.

209

Clymer, T., Blanton. W. E., Johnson. D. D., & Lapp, D. (1980). Mas· tery test: Tell me how the sun rose, Reading 720, Leve/ I 1. Lexington, MA: Ginn & Co. Deno, S. L., Mirkin, P. K . & Chiang. B. (1982). Identifying valid measures of reading. Exceptional Children, 49( I). 36-45. Fuchs, L. S., Deno, S. L., & Marston, D. (1983). Improving the reliability of curriculum-based measures of academic skills for psychoeducational decision making. Diagnostique, 8(3). 135-149. Fuchs, L. S., Deno, S. L.. & Mirkin, P. K. (1984). The effects of frequent curriculum-based measurement and evaluation of pedagogy, student achievement, and student awareness of learning. American Educational Reuarch Journal, 2/(2), 449-460. Fuchs, L. S., Tindal, G., Shinn, M. R., Fuchs, D .. Deno, S. L., & Germann, G. ( 1983). The technical adequacy of a basal readinR mastery test: The Ginn 720 series (Research Report No. 122). Minneapolis: University of Minnesota. Institute for Research on Leaming Disabilities. (ERIC Document Reproduction Service No. ED 236 195) Fuchs, L. S., Wesson, C., Tindal, G., Mirkin, P. K .. & Deno, S. L. (1981). Teacher efficiency in continuous evaluation of JEP Roals (Re· search Report No. 53). Minneapolis: University of Minnesota, Institute for Research on Leaming Disabilities. (ERIC Document Reproduction Service No. ED 215 467) Glaser, R., & Nitko, J. (1971). Measurement in learning and instruction. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.). Washington, OC: American Council on Education. Goodstein, H. A. ( 1982). The reliability of criterion-referenced tests and special education: Assumed versus demonstrated. Journal of Special Education, 16( I). 37-48. Hambleton, R. K. (1980). Test score validity. In R. A. Berk (Ed.). Criterion-referenced measuremen1: The state of the art. Baltimore: The Johns Hopkins University Press. Hambleton, R. K., & Novick, M. R. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educa1ional Measurement. 10(3). 159- 170. Hambleton, R. K., Swaminathan, 1.. Algina, J.. & Coulson. D. B. Criterion-referenced testing and measurement: A review of technical issues and developments. Re,·iew of Educational Research. 48(1 ). 1-47. Huynh. H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement. 13. 253-264. Millman. J. (1974). Criterioh-referenced measurement. In W. J. Popham (Ed.), Evaluarion in education: Current applications. Berkeley: McCutchan. Johns, J. (1981). Scoll, Foresman reading: Sea Treasures End-of-Book Test. Glenview, IL: Scott, Foresman. Naslund. R. A .. Thorpe. L. P., & Lefever, D. W. SRA achievement series: Reading. mathematics. and languaRe arts. Chicago: Science Research Associates. Popham, W. J. (1980). Domain specification strategies. In R. A. Berk (Ed.). Criterion-referenced measurement. The state of the art. Baltimore: The Johns Hopkins University Press. Popham. W. ]., & Husek, T. R. ( 1%9). Implications of criterion-referenced measurement. Journal of Educational Measurement. 6. 1-9. Salvia, J., & Ysseldyke. J. E. (1981 ). Assessment in special and remedial education (2nd ed.). Boston: Houghton-Mifflin. Skager, R. (1971). The system for ob1ectives-based evaluation-reading. Evaluation Comment. 3. 6-11. Subkoviak, M. J. (1975). Estimating reliabilitvfrom a sin!lle administration of a masterv test. Madison, WI: Laboratory of Experimental Design, University of Wisconsin. Tindal, G .. Fuchs, L. S .. Fuchs, D .. Shinn. M. R .. Deno. S. L.. & Germann, G. ( 1983). The technical adequacy of a basal reading mas· tery test: The Seo/I. Foresman series (Research Report No. 128). Minneapolis: University of Minnesota. Institute for Research on Leaming Disabilities. (ERIC Document Reproduction Service No. ED 236 199) Tindal. G .. Shinn, M. R .. Fuchs. L. S., Fuchs. D .. Deno, S. L., & Germann, G. ( 1983). The technical adequacr of a basal series mastery test (Research Report No. I 13). Minneapolis: University of Minnesota. Institute for Research on Leaming Disabilities. (ERIC Document Re· production Service No. ED 236 191) Yalow, E. S., & Popham. W. J. (1983). Content validity at the cross· roads. Educational Researcher. /2(8). 10-14. 21.

Empirical Validation of Criterion-Referenced Tests

Empirical Validation of Criterion-Referenced Tests

Suggest Documents

Reliability tests and validation tests of the client ... - BioMedSearch

EMPIRICAL TESTS FOR EVALUATION OF MULTIRATE FILTER

Constructing empirical tests of randomness - CRoCS

empirical validation of the structure of an

Empirical Tests of an Information-Motivation

MECHANICAL TESTS FOR VALIDATION OF SEISMIC ISOLATION ...

Empirical Validation of Building Analysis Simulation ... - CiteSeerX

Empirical Validation of Website Timeliness Measures - CiteSeerX

Empirical Validation of Outcomes from Training ...

Empirical Validation and Comparison of the Model

Empirical prediction and validation of antibacterial

An empirical validation of integrated manufacturing

Empirical Validation of Primary Negative Symptoms: Independence ...

EMPIRICAL VALIDATION OF A PROCEDURE TO CORRECT

Positive Psychology Progress Empirical Validation of Interventions

Bankruptcy Risk Model and Empirical Tests - SSRN

A Theory and Empirical Tests

Structure-oriented Behavior Tests in Model Validation

Preliminary Empirical Validation in Elderly Bereaved - CiteSeerX

Requirements Validation Techniques: An Empirical ...

Empirical Validation Procedure for the Knowledge Management ...

Empirical Tests of the FelthamâOhlson (1995) - Rotman School of ...

Empirical Tests of the FelthamâOhlson (1995) - Rotman School of ...

Empirical Tests of a Model of Automobile Choice Incorporating Attitude

Empirical Validation of Criterion-Referenced Tests