School Psychology Review, 2004, Volume 33, No. 2, pp. 204-217
An Examination of Variability as a Function of Passage Variance in CBM Progress Monitoring John M. Hintze University of Massachusetts at Amherst Theodore J. Christ University of Southern Mississippi Abstract. This study examined the effects of controlling the level of difficulty on the sensitivity of repeated curriculum-based measurement (CBM). Participants included 99 students in Grades 2 through 5 who were administered CBM reading passage probes twice weekly over an 11-week period. Two sets of CBM reading progress monitoring materials were compared: (a) grade level material that was controlled for difficulty, and (b) uncontrolled randomly selected material from graded readers. Students’ rate of progress in each progress monitoring series was summarized for slope, standard error of estimate, and standard error of slope. Results suggested that controlled reading passages significantly reduced measurement error as compared to uncontrolled reading passages, leading to increased sensitivity and reliability of measurement.
Curriculum-based measurement (CBM) is a set of standardized and specific measurement procedures that can be used to index student performance in the basic academic skill areas of reading, spelling, written expression, and mathematics calculation (Deno, 1985; Deno & Fuchs, 1987; Fuchs & Deno, 1991; Shinn, 1989). As a variant of curriculum-based assessment (CBA), CBM uses dynamic indicators in the basic skill areas for making educational decisions such as screening, instructional planning, and program evaluation (Shinn & Bamonto, 1998). When used within a problem-solving model (Deno, 2002), the primary purposes of CBM are to: (a) obtain point estimates of basic skill performance to identify and certify potential academic weaknesses, and (b)
monitor student responsiveness to instruction over time in a formative manner. When used to index student progress in a formative manner, CBM has shown to be highly sensitive to student change over time (Fuchs, 1986, 1989, 1993; Fuchs & Fuchs, 1986a, 1986b). In addition to being sensitive to the effects of instruction, however, CBM has also been shown to be influenced by variables other than instruction. For example, the basic CBM datum in reading (i.e., oral reading fluency or rate) can be affected by variables such as who administers the reading passages and where the reading passages are administered (Derr & Shapiro, 1989; Derr-Minneci & Shapiro, 1992) and the level in the curriculum used for probe development (Dunn & Eckert,
Sincere appreciation is extended to Beth Denoncourt, Ana Genao, Keegan Manchester, and Amanda Ryan for their assistance in data collection and preparation. Correspondence concerning this article should be addressed to John M. Hintze, Ph.D., University of Massachusetts at Amherst, School Psychology Program, 362 Hills South, Amherst, MA 01003; E-mail:
[email protected] Copyright 2004 by the National Association of School Psychologists, ISSN 0279-6015 204
CBM Passage Variability and Error
2002; Hintze, Daly, & Shapiro, 1998). When used in a time-series manner, formative decision making and evaluation may be affected by how many data points are available for inspection or the context in which students are being evaluated (e.g., is progress being judged within individual or according to some group standard?) (Shinn, Powell-Smith, & Good, 1996); the nature of the curriculum used for assessment (Hintze & Shapiro, 1997; Hintze, Shapiro, & Lutz, 1994); or by the number of data points used for determining slope (Good & Shinn, 1990; Hintze, Owen, Shapiro, & Daly, 2000; Shinn, Good, & Stein, 1989). In addition to these environmental and decision-making variables, one of the key considerations when adopting CBM for progress monitoring is the manner in which the actual probes are developed. In reading, for example, it has been demonstrated that the material from which reading passage probes are selected and the difficulty of selected material can have significant effects on resultant CBM outcomes. Hintze et al. (1994) found that the type of curriculum used in the sampling process could significantly alter the type of growth that might be observed over time for a student. Specifically, reading curricula that were characterized by uncontrolled readability and vocabulary proved too difficult for students and thus were insensitive to growth over time. In a followup study, Hintze and Shapiro (1997) found that by purposively selecting text and controlling for readability and vocabulary content from otherwise uncontrolled material, sensitive progress monitoring growth information could be obtained that mirrored the type of growth that would be seen in controlled text. Together, results from these two studies suggested that when selecting reading material to be used in CBM reading progress monitoring, assessors should pay particular attention to the difficulty of the chosen text making sure that the readability and vocabulary were appropriate for a given grade. As such, rather than simply randomly selecting text passages across the curriculum without regard for difficulty, assessors should purposively select chosen text with an eye toward the control of suitable grade-appropriate readability and vocabulary content.
Doing so helps to ensure that slope estimates are sensitive to instruction and growth over time. Although the purposive selection of reading text is now a more common practice (Fuchs & Deno, 1994; Fuchs & Fuchs, 1999), one of the issues that has not been given full attention is the effects of material selection on the amount of measurement error that is observed over time when progress monitoring (Christ, 2002a). Some studies have found low amounts of measurement error relative to slope over a progress monitoring time period of 1 to 2 years (Deno, Fuchs, Marston, & Shin, 2001; Fuchs, Fuchs, Hamlett, Walz, & Germann, 1993), whereas other studies conducted over a shorter period of time (e.g., 10-12 weeks) have found significant amounts of measurement error relative to slope or growth data (Hintze et al., 1998; Hintze & Shapiro, 1997). Not surprisingly, because slope and error estimates are calculated over calendar day, longer measurement periods increase the reliability of slope estimates and reduce measurement error much like increasing the number of items on a test increases reliability and reduces error. Conversely, shorter progress-monitoring periods lead to less stable growth estimates and increased measurement error (Christ, 2002b). The juxtaposition of these findings is significant given the fact that in practice, typical student progress monitoring occurs over relatively short time periods rather than over the course of school years. More importantly, however, large amounts of measurement error relative to slope can have effects on the precision of educational decision making (Hintze et al., 2000). With error estimates that are as large or greater than slope estimates, practitioners have no way of knowing whether decisions that are made are more a product of the actual construct of interests (i.e., slope) or of measurement error brought about by brief progress monitoring periods or variations in the material used for assessment. Given the impracticality of monitoring progress over an extended period of time to reduce measurement error, one alternative consideration would be to further alter the manner in which reading assessment material is 205
School Psychology Review, 2004, Volume 33, No. 2
selected in hopes of reducing measurement error over shorter time periods. In particular, it would seem that by carefully controlling the difficulty of selected reading text, reductions in measurement error of both point estimates of CBM reading datum (i.e., standard error of estimate) and growth estimates (i.e., standard error of slope) would be observed. The purpose of this study was to build upon previous literature that has examined the effects of readability on CBM outcomes by comparing the amount of measurement error present when purposively controlling the difficulty of reading passage material as compared to the random selection of assessment passages as originally proposed by the developers of CBM. The study is unique in that it is the first to examine explicitly the amount of error that is contributed to CBM slope estimates as a function of surface level features of written text. It was hypothesized that by carefully controlling the difficulty of text, significant reductions in measurement error would be observed as compared to randomly selected material. Method Participants and Setting A total of 99 students enrolled in eight, second- through fifth-grade classrooms from a middle- to lower-middle-class elementary school (26% free or reduced lunch) located in the Northeast served as participants in the study. An a priori power analysis (Cohen, 1988, 1992; Hintze, 2000) was conducted suggesting that a sample size of 25 students per grade would provide adequate power (.80) for main and interaction effects assuming a medium effect size (.25) and an alpha level of .05. Students with limited English Proficiency as assessed by a certified English as a Second Language (ESL) teacher were not considered for participation. From all remaining students, parent permission for participation was requested. The final sample of 99 students (Grade 2, n = 28; Grade 3, n = 23; Grade 4, n = 26; Grade 5, n = 22) consisted of the following: 92 Caucasian, 2 African American, 3 Hispanic, and 2 Asian. Educationally, all but one of the participating students was reading at an 206
average to above average level as judged by initial CBM reading performance and received their reading instruction within the general education classroom with no supplementary assistance. The remaining student received remedial reading services outside the classroom in addition to their reading instruction in the general education classroom. Materials CBM probes. Using the methods outlined by Shinn (1988, 1989) and Hintze et al. (1998), representative reading samples of controlled and uncontrolled grade level material were specified in Grades 2 through 5. Uncontrolled reading passages were represented by material in which the students were currently being instructed (Houghton Mifflin, 1999). Comparatively, controlled reading passages were represented by grade-appropriate material not including the Houghton Mifflin (1999) series. Controlled reading passages were sampled from a variety of literature-based and skills-based reading series. These included Harcourt-Brace; Silver, Burdett, and Ginn; Scott-Foresman; Macmillan Press; and previous versions of the Houghton Mifflin series. Copyright and publication dates of the controlled reading passages ranged from 1979 to 2000. To ensure a representative and heterogeneous sample, each grade level controlled passage series was composed of a minimum of two passages from each of the five reading series. For each series (i.e., controlled and uncontrolled), 20 reading passage probes were constructed per grade level. No student read from the same reading probe twice during the study. For the uncontrolled set, passages were randomly selected using a random number generator, which indicated the page from which the passage was selected. Potential passage probes were sampled from narrative text only (expository text was not used). In addition, passages written in a poetic or dramatic fashion (e.g., songs, plays) were excluded. Each passage was retyped to minimize the effects of pictures. For measurement purposes, a second copy with a cumulative running word count printed in the right hand margin was pro-
CBM Passage Variability and Error
vided to the examiner for each student. For each probe, a readability score was calculated using the Spache (1953) formula for passages sampled from Grades 2 and 3, and the DaleChall formula (Dale & Chall, 1948) for passages sampled from Grades 4 and 5. The Spache and Dale-Chall formulas were chosen (a) due to their documented reliability in the lower and upper elementary grades, respectively, and (b) because each uses greater than 3,000 age-appropriate words for evaluating text in comparison to other readability formulas that simply consider the number of letters per word, and words per sentence, in the absence of any comprehensible text (Klare, 1984). The controlled passages were developed in the same manner as the uncontrolled passages with the following exceptions: (a) passages were purposively rather than randomly sampled across the five reading curricula, so that (b) each probe in a given grade-level demonstrated a readability score within the middle 5 months of that academic year. For example, all probes from the second-grade controlled series were required to demonstrate readability scores in the range from 2.3 to 2.7; probes from the third-grade series in the range of 3.3 to 3.7, and so on. Table 1 provides the overall means and standard deviations for readability for the controlled and uncontrolled probes series, respectively. Pair-wise t-tests were used to test for significant differences of the two probes series within grade. This was done by pairing Controlled Passage #1 to Uncontrolled Passage #1, Controlled Passage #2 to Uncontrolled Passage #2, and so on. A Bonferroni adjustment was used to set alpha [t (19) = 2.86, α = .01, FW = .10] to minimize familywise error. Results indicated that the probes series differed significantly in Grade 4, with the readability estimates for the uncontrolled series (M = 6.04) being significantly greater than the readability estimates for the controlled series (M = 4.57). Dependent Measures Oral reading rate. The number of words read correctly per minute served as the datum for each individual probe session. Using scoring as outlined by Shinn (1989), words
read correctly were those pronounced correctly, given the reading context. For example, the word read must have been pronounced “reed” when presented in the context of “He will read the book,” not as “red.” Repetitions and self-corrections were counted as correctly read words. In addition, the following types of errors were counted as words read incorrectly: (a) mispronunciations (words that were misread: dog for dig), (b) substitutions (words substituted for the stimulus word; this was often inferred by a one-to-one correspondence between word orders: dog for cat), and (c) omissions (words skipped or not read). Finally, if a student struggled to pronounce a word or hesitated for 3 seconds, he or she was provided the word by the examiner and credited with an error. Calculation of growth and error. Dependent variables were characterized by an index of slope and its associated standard error and standard error of slope. Specifically, the standard error referred to the amount of variability that could be expected for any single datum point collected at any one point in time. For example, if a student’s observed oral reading rate during any one probe session was 90 words read correct per minute, his or her true score might be anywhere from 75 to 105 words read correct per minute considering error in measurement. Here, standard error is used to provide confidence intervals within which a person’s true score is likely to fall in the same manner in which it is used with other standardized assessments. Standard error of slope, on the other hand, is the amount of change in the “line of best fit” that is drawn through a series of data points as a result of taking into consideration the standard error. Building on the example above, if a student were assessed repeatedly over time on 20 occasions, each single datum point theoretically varies around some observed value due to standard error. If this variation is taken into account for each datum point and “lines of best fit” are computed for all possible permutations of the 20 data points, the resultant variation in the “lines of best fit” would represent the standard error of slope. Simply, the standard error of slope is the change in the “line of best fit” that may be expected as a function of the standard error of estimate. 207
School Psychology Review, 2004, Volume 33, No. 2
Table 1 Means and Standard Deviations for Readability by Probe Series and Grade Grade Probe Series
2
3
5
Uncontrolled
2.88 (.81)
3.19 (.66)
6.04 (.82)
6.17 (1.30)
Controlled
2.49 (.15)
3.50 (.15)
4.57 (.13)
5.51 (.15)
Using standard CBM procedures, ordinary-least squares regression (Good & Shinn, 1990; Shinn et al., 1989) was applied for each student between oral reading rate and calendar day. Thus, each student had two sets of growth and error estimates: one that represented growth and variability in the uncontrolled set of progress-monitoring materials and one that corresponded to growth and variability in the controlled set of progress-monitoring materials. In keeping with previous CBM progress-monitoring research, calendar day growth and variability estimates were converted to weekly estimates by multiplying by 7 calendar days per week. Because the primary analyses involved group comparisons, individual student performance was combined and averaged by grade and probe type. Interscorer Agreement Interscorer agreement data were collected on 20% of all possible probes administered during the study. An equivalent percentage of probes from each data cell were subjected to agreement checks. Interscorer agreement was determined on a point-by-point basis by having an independent scorer, naïve to the purposes of the study, record each read word as either correct or incorrect simultaneously from an audiotaped account of a student’s reading. In cases where discrepancies occurred, the scoring of the primary data collector was used as the summary datum. Total percentage agreement was then calculated for each probe by dividing the number of agreements by the number of agreements plus disagreements and multiplying by 100 (House, House, & Campbell, 1981). Agreement scores ranged from 96% to 100% with a mean agreement of 99%. 208
4
Procedural Integrity Procedural integrity was checked across all 11 weeks of the study. Using a checklist, the second author noted whether each data collector followed procedures on a step-by-step basis. Procedures included whether all materials were present, time was kept accurately, instructions were given as specified, and students correctly followed the procedures. All integrity scores were 100%. Procedures Training procedures. Four graduate students enrolled in a school psychology program were trained to administer and score the reading probes and serve as data collectors. Each was provided direct instruction and a written set of instructions detailing how the reading probes were to be administered and scored. Following direct instruction and training, each data collector practiced administering reading probes with each other and the trainer. Training was facilitated by using audiotaped passages read by students from a previous research endeavor. Each data collector listened to and scored a minimum of 10 probes. Scores of at least 95% agreement, with three consecutive passages of 100% agreement, were required for each data collector prior to the initiation of the study. CBM progress-monitoring procedures. Progress-monitoring sessions were conducted twice a week during an 11-week period. Due to a school break and statewide testing, it was possible to conduct 19 of the 20 probe sessions. As such, each student had a maximum of 19 sessions. Missed probe sessions were not rescheduled. All participants
CBM Passage Variability and Error
attended at least 16 of the 19 administration sessions that occurred. Mean session participation for students in the second through fifth grades were 96%, 94%, 92%, and 93%, respectively. At each progress monitoring session, students were provided with two reading passages, one each from the uncontrolled and controlled probe series corresponding to the student’s grade. Passage order within probe type (i.e., uncontrolled and controlled) was randomized at the outset of the study to limit any systematic bias that might occur by presenting passages in the order in which they appeared within a specific text. In addition, order of presentation of the reading passages was counterbalanced across sessions within probe type for each student. Before each probe session, the data collector told the student to read aloud and try to do his or her best reading. The data collector then gave the student a copy of the first passage, made sure the stopwatch was ready, and instructed the student to begin reading at the top of the page. The data collector marked errors on the corresponding scoring sheet. Separate scoring sheets were used for each student. At the end of 1 minute, the data collector stopped the student. If the student was in the middle of a sentence, he or she was allowed to finish; however, the data collector marked where the student was at the end of the minute by placing a bracket after the last word read. The data collector then computed the number of words read correct during the minute. The number of words read correctly per minute on each passage served as the summary data for that probe session. Results Descriptive Statistics and Preliminary Data Analysis Table 2 provides the average slopes (b), standard error of slopes (SE(b)), and standard error of estimates (SEE) by grade and probe type. Descriptive statistics suggested that the distribution of the dependent variables met the assumptions of multivariate normality and were suitable for a doubly multivariate analysis of variance (MANOVA).1 In addition, prior
to any primary analyses the data were analyzed to determine if the nested effects of classrooms within grades or sequence effects (i.e., the order of the counterbalanced administration) might have any effect on subsequent findings. To do so, a doubly multivariate analysis of variance was conducted using the three nested dependent variables (i.e., b, SE(b), and SEE nested within probe type: uncontrolled and controlled) as the within-subject factor, and grade, class, and order as between-subject variables (Christ, 2002a). Using Wilks’ criterion, analysis of the class x order [F (3, 183) = .405, p = .75]; grade x class [F (9, 446) = 1.00, p = .44]; and grade x order [F (9, 446) = .862, p = .56] interactions were not significant.2 Moreover, as expected the withinsubject factors of order [F (3, 183) = .465, p = .710] and class [F (3, 183) = .844, p = .463] were nonsignificant, and the main effect for grade was significant [F (9, 445) = 3.86, p < .000; η2 = .06]. Results of these preliminary analyses suggest that order and class did not have a significant influence on the outcomes within or between grades. Consistent with these results, class and order were collapsed within grade for primary analyses. Primary Data Analysis To assess the effects of grade, probe type, and their interaction, a 2 (CBM probe type) x 4 (grade) doubly multivariate analysis of variance with repeated measures across the first factor (i.e., b, SE(b), and SEE) was conducted. Using Wilks’ criterion, analysis of the grade by probe type interaction was significant [F (9, 285) = 2.86, p = .003; η2 = .08]. Follow-up Roy-Bargmann step-down analysis suggested that the interaction was attributable to the effects of slope (b) [F (3, 93) = 4.88, p = .003]. 3 Important to note, however, was the fact that this analysis had an observed power of .90 and an associated effect size of η2 = .08. Thus, the analysis was quite sensitive to small effects and was most likely indicative of minor group differences and of no real practical consequence. The significance of the interaction is likely attributed to the differences in readability between the probe series in the fourth-grade condition. The fourthgrade condition was the only grade to demon209
210
4
3
Uncontrolled
2
Uncontrolled
Controlled
Uncontrolled
Controlled
Probe Type
Grade
14.36
SEE
15.96
SEE
1.22 15.96
SE(b)
SEE
.48
13.99
SEE
Slope (b)
1.12
SE(b)
.50
1.28
SE(b)
Slope (b)
-.05
12.67
SEE
Slope (b)
1.00
SE(b)
.25
1.13
SE(b)
Slope (b)
-.06
M
Slope (b)
Statistic
(3.06)
(.26)
(1.30)
(2.84)
(.24)
(1.19)
(3.71)
(.28)
(.91)
(1.98)
(.14)
(1.07)
(2.46)
(.17)
(1.61)
(SD)
1.41
1.05
-.48
1.00
.41
.62
.91
.65
-.52
.08
.43
-1.07
.24
1.02
-.90
Skewness
Table 2 Growth and Error Estimates by Grade and Probe Series
.34
.28
-.27
-.41
-1.04
-.81
.01
-.42
-.59
-.82
-.91
-.60
.16
-.12
-.78
Kurtosis
(Table 2 continues)
2.25
2.21
3.14
2.24
2.96
1.43
2.25
2.21
3.14
2.24
2.96
1.43
2.25
2.21
3.14
Homogeneitya
School Psychology Review, 2004, Volume 33, No. 2
Controlled 1.02 1.22 15.01
SE(b)
SEE
18.10
SEE
Slope (b)
1.46
SE(b)
.42
11.82
SEE
Slope (b)
.94
SE(b)
Uncontrolled
.24
Slope (b)
Controlled
M
Statistic
Probe Type
Note. a Homogeneity of variance ratios.
5
Grade
(Table 2 continued)
(2.96)
(0.26)
(1.02)
(3.08)
(0.26)
(.93)
(2.54)
(.20)
(1.00)
(SD)
-.40
.23
-.18
.33
-0.23
.33
.79
.43
1.80
Skewness
-.22
.50
-.97
-.06
-.11
-.06
-.17
-.44
-.33
Kurtosis
2.24
2.96
1.43
3.14
2.21
3.14
2.24
2.96
1.43
Homogeneitya
CBM Passage Variability and Error
211
School Psychology Review, 2004, Volume 33, No. 2
strate significant differences in readability estimates and it was also the only condition where the slopes were significantly different. Results of the tests of main effects indicated significant differences in the multivariate combination of the dependent variables for both grade [F (9, 285) = 4.36, p = .000; η2 = .13] and probe type [F (9, 93) = 24.27, p = .000; η2 = .44].4 Follow-up Roy-Bargmann analysis of probe type suggested that the controlled reading passage probe series evidenced significantly smaller estimates of both SE(b) [F (1, 95) = 56.46, p = .000] (see Table 3) and SEE [F (1, 94) = 9.49, p = .003] with associated effect sizes (Cohen’s d) of 1.54 and 1.86, respectively. Follow-up Roy-Bargmann stepdown analysis of grade indicated significant differences between grades for SE(b) [F (3, 95) = 11.35, p = .000] (see Table 4) but not for SEE or slope. This finding suggests that the reliability of slope estimates varied by grade. This, however, is not surprising given that grade-level conditions differed in student and probe sample characteristics. What is more interesting and significant is that the controlled reading passage probe series consistently resulted in more stable (and thus more reliable, given the inverse relationship between measurement error and reliability) slope estimates as compared to the uncontrolled probe series. Finally, no significant differences were observed for slope [F (1, 93) = 1.63, p = .206]. Overall, the results of these analyses indicate that the controlled reading passage probe series resulted in significantly reduced measurement error as compared to the uncontrolled reading passage probe series, thus improving their reliability. Discussion The purpose of this study was to compare the amount of measurement error present when purposively controlling the difficulty of reading passage material as compared to the random selection of assessment passages as has been suggested by previous research in CBM progress monitoring. Results indicated that significant reductions in measurement error were observed when CBM reading materials were constructed in a manner that considered and 212
controlled the difficulty level of the text presented to a student. That is, by controlling the level of difficulty in the reading text, assessors can minimize the effects of passage or text difficulty as a potential source of measurement error. At the individual passage level, the reduction of measurement error brought about by carefully controlling difficulty results in a smaller standard error of estimate as compared to reading passages for which difficulty is uncontrolled. When multiple data points are collected over time, as is done in CBM progress monitoring, the reduction in standard error of estimate results in reduced standard error or slope due to the symbiotic nature of the two constructs. Results of the current study suggest that assessors can improve both the point estimates of oral reading behavior (i.e., any one datum point collected on a single day) and the stability of data over time (i.e., reductions in variability of performance over time) and provide support for the hypothesized differences in variability as a function of difficulty, as predicted based on the work of Fuchs et al. (1993), Deno et al. (2001), and Hintze and colleagues (Hintze et al., 1994; Hintze & Shapiro, 1997). More importantly, however, it would stand to reason that reductions in measurement error could potentially lead to enhanced data-based decision making. Data-based decisions can only be as good as the reliability of the information from which they are drawn. If reliability of information (in this case CBM data) is improved, the potential for improved decision making is clearly a possibility. Implications for Progress Monitoring An examination of the data reveals a number of relevant points. Importantly, the use of probes derived from purposively controlled reading text reduced the amount of measurement error over that of uncontrolled randomly selected reading passages. The reduction in measurement error came about primarily as a result of the reduction in the SEE. Briefly, “error” in CBM progress monitoring is the difference between observed and predicted performance. Here, observed data are the data that are actually collected using standard CBM procedures, and predicted data are the values that
CBM Passage Variability and Error
Table 3 Means and Standard Deviations for SEE and SE(b) by Probe Type Collapsed Across Grade SE(b)
SEE
Uncontrolled
1.27 (.14)
16.10 (1.53)
Controlled
1.07 (.12)
13.37 (1.41)
would be expected at any one point in time (e.g., school day) on the basis of previous performance. Moreover, samples of observed data are used to infer what a student’s “true” abilities are regarding any construct of interest. Lower levels of measurement error in observed data result in higher levels of confidence that the “observed” performance represents “true” performance or ability. In the case of CBM, for example, if an assessor has been collecting CBM reading data on a student for 4 weeks and is asked to predict or characterize the student’s performance at some time in the future, a “best guess” future estimate would likely be made by examining the previous trend of collected data and using that trend to predict future performance. Using past performance in an ongoing manner to predict future performance is precisely what school psychologists are now required to do in keeping with IDEA ’97. A problem occurs, however, when large amounts of measurement error are found in observed data. When this is the case, a student’s “true” abilities with respect to oral reading could as well be much higher or lower than what was witnessed on the CBM reading probe. Although no one would advocate characterizing a student’s performance on the basis of one reading sample, if multiple samples collected over time have similar measurement error, the problem is compounded. Ultimately, when making an educational decision, one cannot tell if the basis for making the decision is predicated on a student’s “true” reading performance over time or data that are marred by error (and thus not representative of a student’s “true” reading abilities). Simply, the confidence and precision of educational decisions is affected by the amount
of measurement error contained in the data used to make the decision. By reducing the amount of measurement error in CBM reading progress monitoring, practitioners can be more confident that the observed data that are collected accurately characterize a student’s performance at one point in time. In addition, when multiple samples of reading behavior are collected over time, as is the case of CBM progress monitoring, confidence is increased in using previously collected data to predict future performance as in the case of formative evaluation. Limitations/Strengths and Implications for Future Research Although the current study has attempted to shed new light on the relationship between the difficulty of assessment material and CBM reading progress monitoring, the results must nonetheless be interpreted in light of a number of limitations that might affect the generalized causal inference of the findings (Cook & Campbell, 1979; Shadish, Cook, & Campbell, 2002). First, there is the possibility that the passages differed from each other in ways other than just difficulty level. In particular, because one set of passages was selected from material in which the students were being instructed and the other set of passages from a variety of reading curricula, differences in variability might be due to familiarity and not difficulty. This may be further compounded by the fact that in its typical use for progress monitoring, CBM generally uses a “long-term goal” approach in selecting individual progress-monitoring material where challenging rather than grade-appropriate material is used for assessment. Individualizing such 213
School Psychology Review, 2004, Volume 33, No. 2
Table 4 Means and Standard Deviations for SEE and SE(b) by Grade Collapsed Across Probe Type Grade
SE(b)
2
1.06 (.09)
13.52 (1.19)
3
1.20 (.11)
14.98 (1.39)
4
1.08 (.19)
13.89 (2.93)
5
1.34 (.17)
16.56 (2.18)
progress-monitoring material in group studies presents a challenge, as varying material on an individual basis would confound group results. The use of grade level material may have had some effect on the observed results, but it would seem that familiarity would lead to reductions in variability rather than to increases in variability as was observed in the current study. Nonetheless, the issue of from where the passages were drawn is an important consideration. Those doing future studies should consider selecting all measurement material from the same sampling pool (i.e., material currently used for instruction versus materials not used for instruction) to keep familiarity consistent. Second, questions regarding the temporal order of the passages, participant selection, and the interaction of these threats may have some effect on the internal validity of the study. Specifically, in the case of the uncontrolled passages, because the difficulty of the reading material was not controlled there is no way of knowing how the order of presentation of the passages affected the slope outcomes. This could have been particularly problematic if an unusually easy or difficult passage was administered in the beginning or at the end of the time-series data run. Ordinary least squares regression techniques are sensitive to outliers in the analysis—particularly to the temporal presence of the outliers. An unusually easy or difficult reading passage at the beginning or end of the data run would have undue influence on the slope outcomes by “pulling” the trend line or line of best fit in the direction of 214
SEE
the outlier. If this were the case, the observed slope estimates would not be an accurate reflection of true growth over time. Although this might be perceived as a limitation from a statistical summary perspective, the procedures used in the uncontrolled condition were characteristic of those that are typically recommended. A similar concern would not appear evident in the controlled probe condition because unusually easy or difficult passages were systematically screened for inclusion in the reading passage series. Third, in addition to material selection, participant selection may also have had some effect on the observed outcomes. The concern here is that all students in the study received their primary reading instruction in the general education classroom and none in special education. Indeed, an examination of reading rates by grade indicated that, on average, students were reading above commonly accepted oral reading rate performance standards (Shapiro, 1996). As such, it would appear that participant selection might have been biased in the direction of better than average readers. If this were the case, the material presented to the students might have been in their instructional or mastery range, rather than long-term goal-level material, which is typically specified when using CBM. This would also have an effect on the observed slope estimates, as students might have read at one consistent level given that the material might not have provided room for growth as students’ skills improved. Logistically, to overcome this problem would require
CBM Passage Variability and Error
different sets of reading passages for each student within grade. Doing so would thus mitigate against any type of group comparisons or comparisons of variability at the group level. Future studies might examine this type of variation using single-case experimental design, which would allow for an examination of individual variability over time. From an external validity perspective, concerns regarding the interaction of causal relationships and treatment variations and settings might be of some concern. In particular, it is not known whether similar results would be observed using different reading curricula in both the uncontrolled and controlled passage conditions. This is probably less of a concern in the controlled conditions as passages were selected across a variety of curricula; however, whether similar results would be evidenced in uncontrolled curricula other than the Houghton Mifflin (1999) series are unknown. Different results may perhaps be observed in curricula that make more of an attempt to purposively control the difficulty of the reading material. If difficulty were purposively controlled from the outset, the differences in variability might diminish as a function of this control. If this were the case, carefully controlling the difficulty of the CBM progress-monitoring passages would not be needed. Also, whether similar results would be observed in settings other than the current (e.g., schools, geographic region) are unknown. Replication and extension of the current study would help answer these questions. Finally, on a positive note, it would appear that the current study provided adequate control of threats to statistical conclusion and construction validity. In particular, it would appear that the study had adequate statistical power at the outset to correctly conclude relationships among variables and did not violate assumptions of statistical analyses, or suffer from error rate problems or unreliability of measures or treatment implementation. Moreover, construct validity was carefully controlled at the outset of the study through the selection of the passages, and data collectors as well as participants were kept blind as to the purposes of the study. These strengths, com-
bined with minimal threats, would suggest some allowance for generalized causal inference. Conclusions The purpose of this study was to compare the amount of measurement error present when purposively controlling the difficulty of reading passage material as compared to the random selection of assessment passages as has been suggested by previous research in CBM progress monitoring. Based on the current findings, it appears that we can expect differential levels of measurement error as a function of the degree of difficulty of reading passages used for progress monitoring. Consequently, teachers, school psychologists, and researchers may want to reconsider their current progress monitoring procedures, particularly is it relates to characteristically uncontrolled reading material. Although CBM in reading continues to be a robust approach to instructionally relevant measurement, the results of this study should help practitioners and researchers enhance the fidelity of their assessments as well as improve formative decision making. Footnotes 1 Data also met the necessary assumptions for Roy-Bargmann stepwise regression (used in place of ANOVA). The collapsed sample yielded consistent cell sizes with case-to-dependent variable ratios greater than 3:1 for all groups. Because the dependent measurement variables were noncommensurate, Box’s M rejected homogeneity of variance-covariance matrix [F (63, 19806) = 3.92, p < .000]. This finding was a result of the noncommensurate dependent measurement variables (i.e., b, SE(b), and SEE). Hartley’s test, a more appropriate measure of homogeneity under these conditions, did support the assumption of homogeneous variance for each dependent measurement variable across group conditions [Fmax (4, 98) = 3.13, p > .05]. 2 Higher order profiles (i.e., grade x order x class) were not tested because n = 0 for some conditions. 3 Prior to the primary analyses, the dependent variables were evaluated for dependency. As expected, results revealed a strong positive relationship between the two error components (i.e., SE(b) and SEE; r = .93). Strong association between de-
215
School Psychology Review, 2004, Volume 33, No. 2
pendent variables violates the assumption of independence in ANOVA for tests of univariate effects. In cases such as these, the Roy-Bargmann step-down procedure is the recommended post-hoc alternative. In this procedure, a one-way ANOVA is used to test for the highest priority dependent variable (in this case SE(b)). Each successive dependent measure is then tested in a series of ANCOVAs using all higher priority dependent variables as covariates. To control for familywise error, Bonferonni adjustment was used to control familywise error (FW = .10; α = .01). 4 Profile analysis of repeated measures allows for the interpretation of main effects in the presence of interactions without violating any of assumptions statistical conclusion validity (Tabachnick & Fidell, 2001). As such, tests of main effects were conducted and interpreted as appropriate.
References Christ, T. J. (2002a). The effects of passage-difficulty on CBM progress monitoring outcomes: Stability and accuracy. Unpublished doctoral dissertation, University of Massachusetts, Amherst. Christ, T. J. (2002b). The reliability of progress monitoring growth using curriculum-based measurement techniques: Stability, accuracy, and reliability. Manuscript in preparation. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York: Lawrence Erlbaum Associates Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159. Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design & analysis issues for field settings. Boston: Houghton Mifflin. Dale, E., & Chall, J. (1948). A formula for predicting readability. Educational Research Bulletin,27, 11-20. Deno, S. L. (1985). Curriculum-based measurement: The emerging alternative. Exceptional Children, 52, 219-232. Deno, S. L. (2002). Problem solving as “Best Practice.” In A. Thomas & J. Grimes (Eds.), Best practices in school psychology IV (pp. 37-56). Bethesda, MD: National Association of School Psychologists. Deno, S. L., & Fuchs, L. S. (1987). Developing curriculum-based measurement systems for data-based special education problem solving. Focus on Exceptional Children, 19, 1-15. Deno, S. L., Fuchs, L. S., Marston, D., & Shin, J. (2001). Using curriculum-based measurement to establish growth standards for students with learning disabilities. School Psychology Review, 30, 507-524. Derr, T. F., & Shapiro, E. S. (1989). A behavioral evaluation of curriculum-based assessment of reading. Journal of Psychoeducational Assessment, 7, 148-160. Derr-Minneci, T. F., & Shapiro, E. S. (1992). Validating curriculum-based measurement in reading from a behavioral perspective. School Psychology Quarterly, 7, 2-16.
216
Dunn, E. K., & Eckert, T. L. (2002). Curriculum-based measurement in reading: A comparison of similar versus challenging material. School Psychology Quarterly, 17, 24-46. Fuchs, L. S. (1986). Monitoring progress among mildly handicapped pupils: Review of current practice and research. Remedial and Special Education, 7, 5-12. Fuchs, L. S. (1989). Evaluating solutions, monitoring progress and revising intervention plans. In M. R. Shinn (Ed.), Curriculum-based measurement: Assessing special children (pp. 153-181). New York: Guilford. Fuchs, L. S. (1993). Enhancing instructional programming and student achievement with curriculum-based measurement. In J. J. Kramer (Ed.), Curriculum-based measurement (pp. 65-103). Lincoln, NE: University of Nebraska-Lincoln, Buros Institute of Mental Measurements. Fuchs, L. S., & Deno, S. L. (1991). Paradigmatic distinctions between instructionally relevant measurement models. Exceptional Children, 57, 488-500. Fuchs, L. S., & Deno, S. L. (1994). Must instructionally useful performance assessment be based in the curriculum? Exceptional Children, 61, 15-24. Fuchs, L. S., & Fuchs, D. (1986a). Curriculum-based assessment of reading progress toward long-term and short-term goals. The Journal of Special Education, 20, 69-82. Fuchs, L. S., & Fuchs, D. (1986b). Effects of systematic formative evaluation: A meta-analysis. Exceptional Children, 53, 199-208. Fuchs, L. S., & Fuchs, D. (1999). Monitoring student progress toward development of reading competence: A review of three forms of classroom-based assessment. School Psychology Review, 28, 659-671. Fuchs, L. S., Fuchs, D., Hamlett, C. L., Walz, L., & Germann, G. (1993). Formative evaluation of academic progress: How much growth can we expect? School Psychology Review, 22, 27-48. Good, R. H., & Shinn, M. R. (1990). Forecasting accuracy of slope estimates for reading curriculum-based measurement: Empirical evidence. Behavioral Assessment, 12, 79-193. Hintze, J. L. (2000). PASS power analysis. Kaysville, UT: NCSS Statistical Software. Hintze, J. M., Daly, E. J., & Shapiro, E. S. (1998). An investigation of the effects of passage difficulty level on outcomes of oral reading fluency progress monitoring. School Psychology Review, 27, 433-445. Hintze, J. M., Owen, S. V., Shapiro, E. S., & Daly, E. J. (2000). Generalizability of oral reading fluency measures: Application of G theory to curriculum-based measurement. School Psychology Quarterly, 15, 5268. Hintze, J. M., & Shapiro, E. S. (1997). Curriculum-based measurement and literature-based reading: Is curriculum-based measurement meeting the needs of changing reading curricula? Journal of School Psychology, 35, 351-375. Hintze, J. M., Shapiro, E. S., & Lutz, J. G. (1994). The effects of curriculum on the sensitivity of curriculumbased measurement in reading. The Journal of Special Education, 28, 188-202.
CBM Passage Variability and Error
Houghton Mifflin. (1999). Invitation to literacy. Boston: Author. House, A. E., House, B. G., & Campbell, M. B. (1981). Measures of interobserver agreement: Calculation formula and distribution effect. Journal of Behavioral Assessment, 3, 37-57. Klare, G. R. (1984). Readability. In P. D. Pearson, R. Barr, M. L. Kamil, & P. Rosenthal (Eds.), Handbook of reading research (pp. 681-744). New York: Longman. Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston: Houghton Mifflin. Shapiro, E. S. (1996). Academic skills problems: Direct assessment and intervention (2nd ed.). New York: Guilford Press. Shinn, M. R. (1988). Development of curriculum-based local norms for use in special education decision-making. School Psychology Review, 17, 61-80. Shinn, M. R. (Ed.). (1989). Curriculum-based measurement: Assessing special children. New York: Guilford Press.
Shinn, M. R., & Bamonto, S. (1998). Advanced applications of curriculum-based measurement: “Big ideas” and avoiding confusion. In M. R. Shinn (Ed.), Advanced application of curriculum-based measurement (pp. 1-31). New York: Guilford Press. Shinn, M. R., Good, R. H., & Stein, S. (1989). Summarizing trend in student achievement: A comparison of methods. School Psychology Review, 18, 356-370. Shinn, M. R., Powell-Smith, K. A., & Good, R. H. (1996). Evaluating the effects of responsible reintegration into general education for students with mild disabilities on a case-by-case basis. School Psychology Review, 25, 519-539. Spache, G. (1953). A new readability formula for primary grade materials. Elementary English, 53, 410-413. Tabachnick, B. G., & Fidell, L. S. (2001). Using multivariate statistics (4th ed.). Boston: Allyn & Bacon.
John M. Hintze received his PhD in School Psychology from Lehigh University in 1994 and is an Associate Professor of School Psychology at the University of Massachusetts at Amherst. His primary research interests are in academic and behavioral assessment, research design, and data analysis. Theodore J. Christ (
[email protected]) received his PhD in School Psychology from the University of Massachusetts at Amherst in 2002 and is now faculty with the School Psychology Program at The University of Southern Mississippi. He is interested in the development of assessment and intervention procedures that facilitate school-based problem solving. His research examines the psychometric properties of dynamic assessments, subskill analysis of academic tasks, and the implementation of ecologically based interventions.
217