Learning Disabilities Research & Practice, 24(4), 174–185 C 2009 The Division for Learning Disabilities of the Council for Exceptional Children
How Can We Improve the Accuracy of Screening Instruments? Evelyn S. Johnson Boise State University
Joseph R. Jenkins University of Washington, Seattle
Yaacov Petscher Florida Center for Reading Research, Florida State University
Hugh W. Catts University of Kansas Screening for early reading problems is a critical step in early intervention and prevention of later reading difficulties. Evaluative frameworks for determining the utility of a screening process are presented in the literature but have not been applied to many screening measures currently in use in numerous schools across the nation. In this study, the accuracy of several Dynamic Indicators of Basic Early Literacy Skills (DIBELS) subtests in predicting which students were at risk for reading failure in first grade was examined in a sample of 12,055 students in Florida. Findings indicate that the DIBELS Nonsense Word Fluency, Initial Sound Fluency, and Phoneme Segmentation Fluency measures show poor diagnostic utility in predicting end of Grade 1 reading performance. DIBELS Oral Reading Fluency in fall of Grade 1 had higher classification accuracy than other DIBELS measures, but when compared to the classification accuracy obtained by assuming that no student had a disability, suggests the need to reevaluate the use of classification accuracy as a way to evaluate screening measures without discussion of base rates. Additionally, when cut scores on the screening tools were set to capture 90 percent of all students at risk for reading problems, a high number of false positives were identified. Finally, different cut scores were needed for different subgroups, such as English Language Learners. Implications for research and practice are discussed.
The context for this research is screening within a response to intervention (RTI) framework. RTI is a multitiered instructional and service delivery model designed to improve student learning by providing high-quality instruction, intervening early with students at risk for academic difficulty, allocating instructional resources according to students’ needs, and distinguishing between students whose reading difficulties stem from experiential and instructional deficits as opposed to a learning disability. Derived from the prevention sciences, RTI represents an attempt to identify and help struggling readers early before academic problems develop into intractable deficits. The starting point for RTI is the universal screening (e.g., all students are screened) for academic difficulties. The purpose of screening in RTI is to identify students who, despite a strong general education program (Tier 1), are on a path to failure. In prevention models, screens are de-
Requests for reprints should be sent to Evelyn Johnson, Boise State University, 1910 University Dr, MS 1725, Boise, ID 83725. Electronic inquiries may be sent to
[email protected].
signed to predict a negative outcome (the criterion) months or years in advance of the outcome so that appropriate, early intervention can be provided. In this way, reading screens identify students who without intervention are destined to fail a future reading test (i.e., read at a level deemed unsatisfactory). Some schools use norm-referenced test scores for their criterion measure, defining poor reading by a score corresponding to a specific percentile (e.g., below the 10th, 15th, 25th, or 40th percentile). Others define poor reading according to a predetermined standard (e.g., scoring below “basic”) on the state’s proficiency test. The important point is that satisfactory and unsatisfactory reading outcomes are dichotomous, defined by a cut point on a reading test given later in the students’ career. In identifying candidates for intervention services, schools typically follow one of two models (Jenkins, Hudson, & Johnson, 2007). In Direct Route Models students identified as at risk by a screening process are immediately provided Tier 2 intervention (e.g., Vellutino et al., 1996). By contrast, in Progress Monitoring, or PM Route Models, universal screening identifies potentially at-risk students whose
LEARNING DISABILITIES RESEARCH
progress is subsequently monitored for several weeks. These potentially at-risk students may enter Tier 2 intervention depending on the level of their performance and rate of growth on PM measures (Compton, Fuchs, Fuchs, & Bryant, 2006). The PM Route yields better identification accuracy than the Direct Route. For example, Compton et al. (2006) improved overall classification accuracy from 81 to 83 percent and were able to identify 90 percent of the students at risk when they added 5 weeks of word identification fluency progress monitoring to a composite screening measure. Although PM routes can improve accuracy, it adds to the logistical challenges of screening and can postpone intervention during the PM phase. By contrast, the Direct Route leads to earlier intervention, but without PM to catch screening errors more students are mistakenly identified as at risk. In a national RTI model site identification project conducted by the National Research Center on Learning Disabilities, most schools reported using a Direct Route Model (Mellard, Byrd, Johnson, Tollefson, & Boesche, 2004). An ideal screen is practical (i.e., inexpensive, relatively brief, easily administered, scored, and interpreted) and has consequential validity—meaning that the net effect for students is positive (Messick, 1989). That is, students identified as at risk should receive timely and effective intervention, and no students or groups should be shortchanged. A screen must be relatively accurate—capable of distinguishing students who will subsequently demonstrate performance difficulties from those who will not (Glover & Albers, 2007; Jenkins, 2003). A variety of statistics are used to evaluate a screen’s accuracy, most prominently sensitivity and specificity. Sensitivity focuses on a screen’s accuracy in identifying individuals who fail a later criterion test (e.g., the percentage of the individuals classified as unsuccessful on the future criterion reading test who were correctly identified as at risk on the screen). Sensitivity is calculated as a ratio of the number of students correctly identified by the screen as at risk (true positives) to the total number of students who perform poorly on the outcome measure (true positives + false negatives, or students who pass the screen but fail the outcome measure). By contrast, specificity focuses on a screen’s accuracy in identifying individuals who will pass the later criterion measure (e.g., the percentage of the individuals classified as successful on the future criterion reading test who were correctly identified as not at risk on the screen). Specificity is calculated as a ration of the number of students who are correctly identified as not at risk (true negatives) to the total number of students who are successful on the outcome measure (true negatives + false positives, or students who are identified as at risk by the screen but later are successful on the outcome measure). As sensitivity levels decline, screens miss more truly atrisk students. As specificity declines, screens overidentify students as at risk who are not really at risk. Are there minimums for sensitivity and specificity? If the purpose of screening is to ensure that truly at-risk students are identified and helped, minimum sensitivity should be high, perhaps as high as 90 percent, so that few students slip through the cracks (Jenkins, 2003; Jenkins et al., 2007; Jenkins & Johnson, 2008). Specificity levels should also be relatively high
175
because overidentifying students as at risk taxes school resources, assigning intervention services to students who do not need them. Sensitivity and specificity levels can be difficult to interpret on their own and of limited practical utility without reporting associated cut scores. Receiver operating characteristic (ROC) curves present another useful way to interpret sensitivity and specificity levels and to determine related cut scores. ROC curves are a generalization of the set of potential combinations of sensitivity and specificity possible for predictors (Pepe, Janes, Longton, Leisenring, & Newcomb, 2004). ROC curve analyses not only provide information about cut scores, but also provide a natural common scale for comparing different predictors that are measured in different units, whereas the odds ratio in logistic regression analysis must be interpreted according to a unit increase in the value of the predictor, which can make comparison between predictors difficult (Pepe et al., 2004). An overall indication of the diagnostic accuracy of an ROC curve is the area under the curve (AUC). AUC values closer to 1 indicate the screening measure reliably distinguishes among students with satisfactory and unsatisfactory reading performance, whereas values at .50 indicate the predictor is no better than chance (Zhou, Obuchowski, & Obushcowski, 2002). One other common metric for reporting a screening procedure’s utility is classification accuracy. Classification accuracy represents the percentage of students correctly classified as either true positives or true negatives. Classification accuracy, however, must be evaluated by considering the base rate of the population with the condition being predicted (Meehl & Rosen, 1955; Wilson & Reichmuth, 1985). If the base rate of a condition (e.g., reading disability [RD]) is relatively low, high classification accuracy could be obtained simply by declaring everyone to be not at risk (Wilson & Reichmuth, 1985). For example, if in a population of 100 students the incidence of reading disability is 10 percent, declaring all students to not have a reading disability results in 90 percent classification accuracy—the 90 students without a reading disability are correctly identified. However, we have 0 percent sensitivity and 100 percent specificity. If we decide that it is important to identify and intervene for students with RD and hence implement a screening procedure that identified as at risk 5 students who do have RD and 5 students who do not have RD, the overall classification accuracy would still be 90 percent. Sensitivity would be 50 percent (5/10) with specificity of 94 percent (85/90). As these examples show, classification accuracy and the implications of decisions made on the basis of screening results can be misleading if not considered within the context of base rates (Wilson & Reichmuth, 1985). Ideally, a screening instrument improves classification accuracy at a rate that is significantly better than that obtained by identifying no students and also leads to decisions about intervention that are educationally sound and cost effective (e.g., providing intervention early to students identified as at risk through screening is more efficient than waiting until difficulties become more pronounced and perhaps intractable). In this study, we report sensitivity, specificity, cut scores, classification accuracy, and ROC AUC as indications of the various screening measures’ utility.
176
JOHNSON ET AL.: IMPROVE THE ACCURACY OF SCREENING INSTRUMENTS
By far, the most widely adopted early screening instrument for identifying at-risk readers is the Dynamic Indicators of Basic Early Literacy Skills (DIBELS; Good & Kaminski, 2002), which has been used by more than 14,000 schools (https://dibels.uoregon.edu/data/index.php) in assessing more than 1,800,000 students (Samuels, 2007). The DIBELS consist of five brief tests—Letter Name Fluency (LNF), Initial Sound Fluency (ISF), Phoneme Segmentation Fluency (PSF), Nonsense Word Fluency (NWF), and Oral Reading Fluency (ORF) administered at different points across Grades K–3. DIBELS’ widespread adoption can be attributed in part to its grounding in letter, phonological, decoding, and text reading skills that are building blocks of beginning reading competence and in part to its ease of use. However, the DIBELS screens are only moderately accurate in predicting which students will read satisfactorily or unsatisfactorily at the end of Grade 1 (Fuchs, Fuchs, & Compton, 2004; Jenkins et al., 2007; Riedel, 2007; Schatschneider, 2006). For example, Riedel (2007) reported sensitivity and specificity levels for beginning of Grade 1 screens using LNF (68 percent and 65 percent, respectively), PSF (61 percent and 60 percent, respectively), and NWF (68 percent and 68 percent, respectively) in distinguishing between children who ended Grade 1 with poor versus good reading comprehension on a standardized test. Depending on the specific DIBELS measure, the screen missed between 32 percent and 39 percent of the truly at-risk students, and mistakenly overidentified between 32 percent and 40 percent of the children. Similarly, Schatschneider (2006) found that the DIBELS LNF screen administered at the end of kindergarten produced sensitivity and specificity levels of 52 percent and 85 percent, respectively, in distinguishing between children who ended Grade 1 with good or poor reading comprehension on a standardized test. Good (2008) acknowledged that DIBELS screens are more accurate in classifying students who reside at the ends of the reading skill distribution than students who are in the middle and noted that the DIBELS trifold classification of relative risk—“at risk,” “some risk,” and “low risk”—takes this problem into account. In theory, intervention decisions for DIBELS “at risk” and “low risk” groups are straightforward (i.e., Tier 2 for the former, Tier 1 for the latter), but intervention decisions for students in the “some risk” group are not. Students who fall in the “some risk” category are as likely to perform satisfactorily as unsatisfactorily on outcome measures (Buck & Torgesen, 2003), making predictions based on this risk category more challenging for practitioners. In practice, even decisions in the “at risk” or “low risk” categories may have questionable accuracy so that schools employing DIBELS in a Direct Route RTI model lack adequate information to improve decision making about all three groups. Can We Improve on DIBELS Screening Accuracy? Researchers have reported an advantage for multiple measure screening batteries (Catts, Fey, Zhang, & Tomblin, 2001; Compton et al., 2006; Foorman, Francis, Fletcher,
Schatschneider, & Mehta , 1998; Jenkins & O’Connor, 2002). That is, rather than relying on a single screening measure, classification accuracy is sometimes improved by including multiple measures in the screening battery (Davis, Lindo, & Compton, 2007; O’Connor & Jenkins, 1999). For example, in the DIBELS system, screening of beginning first graders typically focuses on NWF. Expanding the screening battery to take into account Letter Sound Fluency (LSF), PSF, and ORF or a vocabulary measure could potentially add to screening accuracy. Riedel (2007) found that vocabulary scores distinguished between first-graders whose spring DIBELS ORF scores were satisfactory but whose reading comprehension was not. Although this finding derived from concurrent endof-year scores (i.e., the vocabulary and the criterion tests occurred together), it suggests that including vocabulary in a screening battery might add to classification accuracy when reading comprehension is the outcome of interest. In addition, information about students’ poverty status and language background (e.g., English Language Learners [ELLs]) also holds potential for improving screening accuracy (Jenkins et al., 2007). Two studies have found performance differences associated with language and poverty status on ORF and state standards reading tests (Hixson & McGlinchey, 2004; Wiley & Deno, 2005). Adjustments in screening cut scores for these groups may improve classification accuracy. In this study, we focused on kindergarten and first-grade screens, asking whether the accuracy of DIBELS measures could be improved by combining them with other DIBELS measures, non-DIBELS measures (e.g., vocabulary), and status markers (e.g., language status). In this study, we focused on early screens because research supports that early identification and intervention is generally more successful in reducing the number of students later identified as having a reading disability (Snow, 1998). Following the procedures proposed by Jenkins et al. (2007) and Meehl and Rosen (1955) for screening research protocols, we: 1. Defined “at risk” in terms consistent with operational definitions (e.g.,