THE JOURNAL OF THE LEARNING SCIENCES, 00: 1–61, 2012 Copyright © Taylor & Francis Group, LLC ISSN: 1050-8406 print / 1532-7809 online DOI: 10.1080/10508406.2011.652320
Multilevel Assessment for Discourse, Understanding, and Achievement Daniel T. Hickey Learning Sciences Program Indiana University
Steven J. Zuiker Center for Technology in Teaching & Learning Rice University
Evaluating the impact of instructional innovations and coordinating instruction, assessment, and testing present complex tensions. Many evaluation and coordination efforts aim to address these tensions by using the coherence provided by modern cognitive science perspectives on domain-specific learning. This paper introduces an alternative framework that uses emerging situative assessment perspectives to align learning across increasingly formal levels of educational practice. This framework emerged from 2 design studies of a 20-hr high school genetics curriculum that used the GenScope computer-based modeling software. The 1st study aligned learning across (a) the contextualized enactment of inquiry-oriented activities in GenScope, (b) “feedback conversations” around informal embedded assessments, and (c) a formal performance assessment; the 2nd study extended this alignment to a conventional achievement test. Design-based refinements ultimately delivered gains of nearly 2 SD on the performance assessment and more than 1 SD in achievement. These compared to gains of 0.25 and 0.50 SD, respectively, in well-matched comparison classrooms. General and specific assessment design principles for aligning instruction, assessment, and testing and for evaluating instructional innovations are presented.
Current educational reforms are increasing pressure to use externally developed achievement tests to evaluate education and validate instructional innovations. Correspondence should be addressed to Daniel T. Hickey, Learning Sciences Program, Indiana University, Wright Education Building, 201 N. Rose Avenue, Bloomington, IN 47401. E-mail:
[email protected]
2
HICKEY AND ZUIKER
This highlights the disjoint between typical achievement tests and the way in which most learning scientists, cognitive psychologists, and educational researchers think about learning. As articulated in the National Research Council (NRC) report Knowing What Students Know, “The central problem addressed by this report is that the most widely used assessments of academic achievement are based on highly restrictive beliefs about learning and competence not fully in keeping with current knowledge about human cognition and learning” (Pellegrino, Chudowsky, & Glaser, 2001, p. 12). This report helped catalyze longstanding concerns that the specific factual associations in conventional achievement tests undermine efforts by schools and teachers to focus on more enduring conceptual knowledge (e.g., Linn, 1993; Shepard & Dougherty, 1991). Concerns over the consequences of testing for instruction have recently expanded to the consequences for classroom assessments as well (Pellegrino & Goldman, 2008). This makes sense because classroom assessments fall between classroom instruction and external achievement tests. Particular attention is being directed to the gap between assessments and tests (e.g., Wilson, 2004); the need to coordinate instruction, assessments, and tests (e.g., Atkin, Black, & Coffey, 2001; Gitomer & Duschl, 2007; Wiliam, 2007); and the need for assessments and tests to generate broad educational improvement (Frederiksen & Collins, 1989). This evolution in assessment research demonstrates how the dichotomy between “formative” classroom assessments and “summative” external tests has given way to a more nuanced appreciation of a range of educational practices. For many researchers the formative/summative dichotomy has been replaced by a consideration of intended assessment purposes (Wiliam, 2007) and a continuum of formality that ranges from very informal observation to highly formal achievement tests (Shavelson et al., 2008). New appreciation of this continuum and the coordinative potential of classroom assessment are central to the two multistate consortia that were organized in 2010 in the U.S. Department of Education’s $330-million Race to the Top Assessment initiative. This paper considers the challenges of using classroom assessments to (a) coordinate classroom instruction with achievement testing and (b) impact student achievement with instructional innovations. It then introduces a framework for addressing these challenges that was inspired by sociocultural and situative approaches to assessment. The paper describes the framework’s origins, tracing the development of two multiyear studies of an innovative computer-based curriculum for teaching high school genetics. The paper presents two general assessment design principles and five pragmatic principles for coordinating practice and evaluating innovations. The paper concludes with several specific design principles for implementing feedback conversations around informal formative assessments.
MULTILEVEL ASSESSMENT
3
THE COORDINATION CHALLENGE The tensions between instruction, assessment, and testing deserve attention because they are so pervasive. These tensions have been most obvious in reforms driven by large-scale, high-stakes achievement testing. A wave of such reforms in the 1990s embraced modern cognitive science perspectives on learning and more “authentic” assessment formats in statewide testing in the United States and in national tests in other countries. This expansion of portfolio and performance assessments was reigned in by concerns over increased cost and decreased reliability compared to conventional tests (Dunbar, Koretz, & Hoover, 1991; Shavelson, Baxter, & Pine, 1992) and perceived lack of instructional improvement (Mehrens, 2002). In the United States these concerns helped pave the way for the No Child Left Behind (NCLB) Act in 2000. NCLB has since generated immense pressure on schools to increase the proportion of students meeting achievement criteria. Although the impact of NCLB on achievement remains unclear, the act certainly narrowed curriculum toward academic content that was in the tested domains and readily tested (Koretz, 2008). NCLB also led to an explosion of commercial computer-based test preparation within mandated tutoring programs at schools that failed to make adequate gains (Mintrop & Sunderman, 2009; Peterson, 2005) and expanded use of interim/benchmark tests to predict success on high-stakes tests (Shepard, 2007). Because such practices further emphasize the narrow characterization of knowledge on typical achievement tests, both practices are controversial and opposed by many. Meanwhile, similar tensions confound efforts to reform instruction. Learning scientists struggle to document achievement impact. A 2002 NRC report set out a gold standard for educational research emphasizing experimental designs and external tests (Shavelson & Towne, 2002). Against these criteria the What Works Clearinghouse and federal panels continue to review the evidence of impact for a broad range of curricular options. From this perspective there is little evidence that the leading innovations associated with the learning sciences increase achievement over the curriculum they might supplant or replace. These innovations include ThinkerTools in physics (White, 1993), Jasper Woodbury in mathematics (Cognition and Technology Group at Vanderbilt, 1997), Understanding by Design in science (Kolodner et al., 2003), and Computer-Supported Intentional Learning Environment (CSILE)/Knowledge Forum for collaborative knowledge building (Scardamalia & Bereiter, 1994). They also include the GenScope computer-based modeling software for genetics (Horwitz & Christie, 2000) that is the focus of the research presented in this paper. It can be argued that this limited evidence of achievement impact contributes to skepticism toward the methods and innovations associated with the learning sciences (e.g., Levin & O’Donnell, 1999; Shavelson, Phillips, Towne, & Feuer, 2003). In addition to being pervasive, the tensions among instruction, assessment, and testing are complex. The assumptions behind competing approaches to each are
4
HICKEY AND ZUIKER
often tacit and overlooked (Case, 1996; Greeno, Collins, & Resnick, 1996). This makes the sources of the tensions between practices incomprehensible for many stakeholders. Along with the lack of a clear endpoint or self-evident measure of success, coordination across even two of these three types of educational practices can present what design scholars call “wicked problems” (Rittel & Webber, 1984).
THE ACHIEVEMENT CHALLENGE It is challenging to raise scores on external tests using specific curricular innovations. External tests are inevitably removed from any specific curriculum and typically cover large swaths of knowledge (Rothstein, Jacobsen, & Wilder, 2008). This presents a particular challenge for innovations such as GenScope, ThinkerTools, and Jasper Woodbury that delve deeply into the fundamental concepts of a domain while bypassing more specific and readily tested content. The deep focus of such innovations also makes it difficult to define a valid comparison, because targeted knowledge is unlikely to be similarly clustered in a textbook. For many innovators the achievement challenge is compounded by the way in which textbooks (the likely comparison curricula) have been progressively refined over the years to expose students to the specific knowledge likely to appear on achievement tests. This challenge is further heightened by the continued mergers of textbook publishing and educational testing industries and the increased use of “integrated learning systems.” In these systems a single vender provides the curriculum, test preparation materials, and tests (e.g., Pearson’s SuccessMaker, Waterford’s Early Learning Program). Such systems seemed to be poised to dominate the rapidly expanding K–12 online educational landscape, particularly within for-profit schools. Current circumstances seem to favor one type of innovation: traditional drilland-practice test preparation using newly available networked and immersive technology. A notable example is DimensionM, which embeds algebra practice into an immersive three-dimensional videogame. Like with Math Blaster and other two-dimensional drill-and-practice games, the relationship between the educational content and the game itself is arbitrary (what Rieber, 1996, labeled exogenous). The fantasy context motivates the practice of low-level skills using selected-response items. Initially marketed with one study showing external achievement impact in a randomized design (Kebritchi, Hirumi, & Bai, 2010), DimensionM has been very successful. Its website highlights school-endorsed tournaments and scholarships; field trials with impressive gains; testimonials from administrators, educators, and players; and promised expansion into other curricular domains. Drill-and-practice test preparation programs really do “work” to raise achievement scores. They do so by exposing learners to numerous specific associations.
MULTILEVEL ASSESSMENT
5
As long as some of these associations can be used to recognize the correct answer or even just rule out one or two incorrect choices on some of the items on the targeted tests, scores will go up (Nolen, Haladyna, & Haas, 1992). Because of the way achievement tests are constructed answering just a few more otherwise difficult items can increase test scores substantially (Shepard, 2002). Memory research shows how easily humans can memorize information at this recognitionlevel threshold (Dudai, 1997). Current theories of cognition suggest that such knowledge is unlikely to be recalled when needed in more typical learning and performance contexts; it is even less likely to be applied in real-world situations (Mehrens & Kaminski, 1989). Plus, the inexpensive and self-paced nature of test prep programs is conducive to randomized trials in school contexts, which allows markers to ready prove impact.
PREVAILING APPROACHES TO COORDINATING INSTRUCTION, ASSESSMENT, AND TESTING Despite its title, Knowing What Students Know (Pellegrino et al., 2001) reminded readers that educational assessments do not reveal what students really “know.” Rather, one must have or create a situation or task in which it is possible to observe student performance. Then one must interpret what students do in that situation. To do so one must articulate assumptions about knowing and the model of learning that follows from those assumptions. These assumptions, along with observations and interpretations of performance, make up the three vertices of the now-familiar “assessment triangle.” Knowing What Student Know argued that classroom assessments are typically based on teachers’ intuitive or qualitative models of learning, whereas large-scale assessments are almost always based on statistical models of learning; it argued that both should embrace the assumptions of modern cognitive science that focus on broader conceptual representations of domain knowledge and higher order skills. This modern view of knowledge was articulated in the 1999 NRC report titled How People Learn (Bransford, Brown, & Cocking, 1999). In assessment contexts these assumptions are best manifested in the use of research-based learning progressions that model how understanding unfolds in particular domains (e.g., Duncan & Hmelo-Silver, 2009; Duschl, Schweingruber, & Shouse, 2007; Wilson, 2009). Such models underlie a new generation of computer-based assessment methods that use complex statistical techniques to determine where each student falls on a particular progression (e.g., Mislevy & Haertel, 2006; Quellmalz & Pellegrino, 2009). These evidence-centered design methods are formatively useful for guiding educational decision making and remediation (e.g., Behrens, Mislevy, Bauer, Williamson, & Levy, 2004) and for delivering formative feedback directly to learners to advance their progress on the underlying domain knowledge
6
HICKEY AND ZUIKER
trajectory (e.g., Shute, Hansen, & Almond, 2008). One of the two Race to the Top Assessment consortia is particularly focused on evidence-centered design tests that will be coordinated with both evidence-centered design and teacher-scored interim assessments. These in turn will be coordinated with an extensive set of formative tools and processes to more directly support teacher decision making and student learning (Smarter Balanced Assessment Consortium, 2010). It is beyond the scope of this paper to detail all of the challenges facing these and other current assessment reform efforts. These efforts are occurring alongside significant changes in technological, political, and economic contexts. It is unclear how assessment reforms will impact the climate for innovation in instruction and classroom assessment more broadly, or even whether high-stakes testing will look any different in the years to come. In addition to the psychometric and logistic challenges that impeded prior waves of assessment reform, the more ambitious elements of the current reforms face new challenges. Pressure to estimate each teacher’s “value added” to each student’s achievement encourages a focus on knowledge that steadily and reliably improves from one year to the next (Baker et al., 2010). Other new challenges come from threats to test security from social networking and ubiquitous mobile devices (e.g., Davis, Drinan, & Gallant, 2009). Multipart evidence-centered design items are expensive and time consuming to create and validate. This means that there will be relatively few available, so each one will be used by more students. Meanwhile, NCLB has left behind not only massive pools of conventional items but also the performance data needed for computer-adaptive tests, which are much easier to secure (and are central to the second Race to the Top Assessment consortium). In summary, it seems possible that the more progressive elements of the current reforms will give way to conventional tests, perhaps with even more pressure to increase scores. It is also possible that the more ambitious assessments will be maintained for the low-stakes assessment, with conventional tests used for high-stakes testing. The concern is that most innovation in instruction and assessment will face continued pressure to obtain and document impact on conventional achievement tests. This paper illustrates a way of framing such tests to make them genuinely useful for innovators who might currently disdain them and for assessment reformers who would rather supplant them. This framing situates these tests within a broader assessment model that can obtain and document achievement impact while delivering other valued outcomes and without compromising innovations by focusing on specific associations that might appear on such tests.
RESEARCH CONTEXT The assessment model designed to address the challenges discussed previously emerged across two consecutive design-based research studies using the
MULTILEVEL ASSESSMENT
7
GenScope software. This section summarizes the GenScope software and assessments used in both studies. This summary is presented in the context of the first study, which was previously published elsewhere. It then discusses the unresolved issues from the first study that helped set the stage for the second study. The First GenScope Assessment Project GenScope was developed with the support of the U.S. National Science Foundation by Paul Horwitz and colleagues at BBN, Inc. The program lets students observe and interact with events across the five major levels of biological organization and features a simplified species (a dragon) and colorful easy-to-use interfaces at each level (see Figure 1).1 Students can manipulate information at one level and immediately see the impact at other levels. Consistent with constructionist views of computer-based learning (Papert, 1993), GenScope lets students discover the relationships that define inheritance. Students can insert a mutation into the DNA of the organism and then follow that mutation through the animated “dance of the chromosomes” during mitosis and meiosis. They can experiment with that mutation by breeding multiple generations of offspring and can even witness the consequences of the mutation in simulations of evolution under various environmental conditions. This reveals the complex relationships that make inheritance one of the most challenging topics for secondary science education (Stewart & Hafner, 1994). When the first project was initiated the software was mostly complete, and curriculum specialists were creating various “guided discovery” activities. The first GenScope assessment project was initiated when a team at the Center for Performance Assessment at Educational Testing Service led by Ann Kindfield was asked to create an assessment of and help evaluate learning in GenScope. This was organized around Kindfield’s (1994) foundational studies of learning progressions in inheritance. Rather than working from a curricular sequence or a psychometric model, researchers organized the effort using two underlying dimensions of reasoning about inheritance. The first dimension was the distinction between simpler within-generation reasoning (how genes are expressed as traits) and the more complex between-generation reasoning (how genes and traits are transmitted across generations). The second dimension was the distinction between simpler cause-to-effect reasoning (e.g., the familiar Punnett square) and more complex effect-to-cause reasoning (e.g., using a pedigree chart to figure out how a particular trait was inherited, a common task for geneticists). Crossing 1 The GenScope software and most of the curriculum and assessments described in this paper can be downloaded at www.genscope.concord.org. The program and some of the curricular activities used in this study were subsequently incorporated into a more comprehensive program known as BioLogica, which is available at http://biologica.concord.org/.
8
HICKEY AND ZUIKER
FIGURE 1 Examples of windows in the GenScope software showing different levels of biological organization (color figure available online).
these two dimensions resulted in four increasingly difficult types of inheritance problems. Cause-to-effect/within-generation reasoning is straightforward because limited effects are possible from a set of starting causes. These can be directly learned. Although cause-to-effect/between-generation problems (e.g., Punnett squares) are more complex, they can still be solved with little understanding of the underlying processes. In contrast, effect-to-cause/between-generation problems (e.g., interpreting a pedigree chart) have multiple possible causes that could yield a particular effect. Such problems require a robust cognitive model of the relationships between the elements. Although nearly every high school graduate completes biology, few of them learn to solve such problems. In the 1996 National Assessment of
MULTILEVEL ASSESSMENT
9
Educational Progress, fewer than 25% of 12th graders selected the correct answer for a multiple-choice pedigree chart problem (National Center for Education Statistics, 1996). The NewWorm performance assessment and initial disappointment. A sophisticated performance assessment called the NewWorm was created using open-ended items and conventional representations of genetics. Validation studies showed that it was reliable and that the resulting scores were valid in terms of content, substantive, generalizability, and structural and external aspects (Hickey, Wolfe, & Kindfield, 2000). The NewWorm was initially administered to several classrooms of high school biology students who had just completed 5–10 guided discovery activities in GenScope across dozens of hours. Although the students were quite adept at breeding particular types of dragons, most were unable to solve many of the cause-to-effect problems; furthermore, none were able to solve the more sophisticated effect-to-cause/between-generation problems (Hickey, Kindfield, Horwitz, & Christie, 1999). The research teams debated whether the NewWorm’s traditional stick-figure representations of chromosomes and the new species and traits were simply too different from the more graphic representations and fanciful dragons in GenScope. In terms of the research on learning transfer, this would mean that our performance assessment required too much “far transfer” of learning from the GenScope context. However, it was possible that whatever principles about inheritance students were constructing were so bound to the GenScope context that they could not use that knowledge in other contexts. To resolve the transfer question a near-transfer assessment was quickly created. Screen captures from the GenScope software were used to assemble paperand-pencil items that used the familiar dragons and traits for the problems on the NewWorm performance assessment. But the same students performed only slightly better on the new assessment. This result convinced the teams that students were not discovering fundamental rules of inheritance in their guided discovery activities in GenScope (Hickey, Kindfield, Horwitz, & Christie, 2003). Dragon Investigation informal assessments. Inspired by Wolf, Bixby, Glenn, and Gardner (1991) and Black and Wiliam (1998), the near-transfer items had obvious potential as formative assessments to support more learning. The items were organized into four Dragon Investigations, one for each type of inheritance problem. As elaborated in the second study, a detailed answer explanation was created that explained the reasoning behind various problems on each. After students completed a set of guided discovery activities they would informally complete the corresponding Dragon Investigation and then discuss their answers. These events were originally structured as teacher-led “assessment conversations” that complemented the work of Duschl and Gitomer (1997).
10
HICKEY AND ZUIKER
One early insight was the way students’ collaboration around the GenScope activities provided a rich base of knowledge on which to build. Most of the students found the activities fun and engaging; completing them left behind knowledge of the dragons, their various genes and traits, and the way this information was represented. Even students with little prior experience seemed prepared to engage in relatively sophisticated discussions about the processes of inheritance. Several cycles of refinement revealed a number of new approaches and strategies. In several of the classrooms, near the end of the project, this collaboration evolved into student-directed activities. These were eventually labeled feedback conversations in order to distinguish them from teacher-led assessment conversations. Results and conclusions. With the introduction of the Dragon Investigations the students did much better on the NewWorm. Two counterbalanced NewWorm assessments were created, and students in 17 classrooms completed them before and after 15–20 hours of GenScope activities and Dragon Investigations. Scores were scaled (using a Rasch model), and scale scores showed statistically significant gains in 16 of the classrooms, with the average gain more than a full standard deviation. A follow-up study with two textbook comparison classes obtained gains of more than 3 SD, twice as large as in comparison classes (Hickey et al., 2003). This first project illustrated the potential value of using learning progressions to coordinate learning across classroom activities, informal formative classroom assessments, and a more formal summative assessment. In this sense the project was consistent with current formative assessment research (e.g., Marzano, 2006; Ruiz-Primo & Furtak, 2006; Shavelson et al., 2008). It also revealed how the alignment across these features lent itself nicely to the design-based research methods that many learning scientists embrace. Yet in specific ways the approach that emerged as the first project was concluding and the second project was getting under way diverges from most of the current assessment reforms and research. Sociocultural Perspectives on Assessment As the first project worked to understand the relationship between the GenScope activities, Dragon Investigations, and NewWorm assessment, it drew inspiration from comparative analyses of the three “grand theories” of knowing and learning (i.e., Case, 1996; Greeno et al., 1996). These analyses helped reveal the distinct implications of emerging assessment research that explicitly embraced sociocultural theories of learning (e.g., Gipps, 1999; Moss, 1998; Shepard, 2000). This research revealed that the Vygotskian assumption that new knowledge (i.e., learning) emerges in social interactions before being internalized leads to a broader perspective on learning. This broader perspective draws particular attention to the ways in
MULTILEVEL ASSESSMENT
11
which assessment changes the social context in which that assessment occurs. In so doing, sociocultural approaches elevate the importance of consequential aspects of validity, and particularly the broader social consequences (Moss, 1998). These aspects of validity have traditionally been overshadowed by content-related aspects of validity. In contrast to Messick’s (1995) widely cited characterization of consequential validity as something to be examined after an assessment is developed, sociocultural perspectives start with the intended social consequences of a particular assessment practice (Moss, Girard, & Haniford, 2006). The earlier teacher-directed assessment conversations tended to focus most directly on students’ understanding of inheritance. In some cases they amounted to conventional lessons including known-answer questions, taking advantage of the GenScope context and each student’s attempts to solve the Dragon Investigation problems. Although these conversations certainly shaped student understanding of inheritance, it was unclear whether they had any consequences on the ways in which students interacted with each other. The student-directed feedback conversations appeared to be a more promising way of fostering a “community of learners” (Brown & Campione, 1994) around the GenScope software and Dragon Investigations. Black and Wiliam (1998) emphasized that feedback should be useful for formative functions. The broader sociocultural characterization of learning led us to think more broadly about the kinds of learning that our formative feedback practices might support. The shift to student-led conversations was consistent with Gee’s (1996) now-familiar distinction between “little-d” discourse as languagein-use and “big-D” Discourse among participants in a community of practice. The former characterization of classroom discourse emphasizes how the GenScope activities left all of the students with useful knowledge about the dragon genome. This knowledge was very useful for structuring teacher-led conversations. The latter framing of classroom Discourse emphasizes that those activities left each student with knowledge about their classmates’ knowledge. This helped students recognize that the new technical terms they were encountering in the answer explanations (e.g., homozygote) were new to their classmates as well. This helped them use their shared knowledge of the GenScope software to jointly construct meaning (e.g., “the thingy in the meiosis window”). This also helped us appreciate the significance of students finishing each other’s sentences: One student might say “the hetero . . . ummm . . .” and a group mate would say “heterozygote” and complete the sentence, and the first student would promptly use the new term correctly and continue on. The shift to feedback conversations also embraced an appreciation of how teacher authority can undermine student participation in classroom discourse (Gee & Green, 1998). Nonevaluative feedback conversations seemed likely to invite struggling learners to “try out” the discourses and “try on” the identities (Gee, 2001) associated with the practices of inheritance. This helped explain why more
12
HICKEY AND ZUIKER
typical classroom assessment practices in some classrooms undermined feedback conversations. The teacher-led conversations seemed rather lackluster, and the student-led conversations sometimes faltered when a teacher would join in and explain whatever the students were struggling to understand. These sociocultural insights about classroom discourse structures and teacher question-asking strategies were not new (e.g., Cazden, 1988). However, these insights had not widely been applied to formative assessment. This perspective suggests that feedback that serves to directly correct individual knowledge can undermine learning when learning is defined in sociocultural terms. The term feedback conversation was used to convey the point that domain-specific conversation is itself feedback that helps communities of learners participate in those very conversations. From this perspective students were giving and getting feedback even if they were not directly clarifying and correcting each other’s answers. Situative Perspectives on Coordinating Instruction, Assessment, and Testing Reflecting on the first project using the comparative analyses in Greeno et al. (1996) and Case (1996) also helped reveal the unique potential of situative theories of cognition for understanding and addressing the tensions that the first study raised. To reiterate, many current assessment reforms argue that the range of assessments and tests should be coordinated around modern cognitive perspectives like the one that had been used to develop the NewWorm assessment. The alternative that emerged was the notion of “aligning” (a) informal assessments that embraced a sociocultural perspective with (b) more formal assessments like the NewWorm. The potential value of doing this was suggested by contemporary situative theories of knowing and learning (especially Greeno & the Middle School Mathematics Through Applications Project Group, 1998). Like sociocultural theories, situative theories assume that knowledge emerges in interactive practice between humans. But situative theories place more emphasis on the role of information technologies in interactions. This highlights the socially situated nature of all interactions with information, because that information is socially defined. This applies even in isolated interaction between individuals and information resources, including assessments. In this way these perspectives focus on socially defined knowledge, treating the knowledge that individuals take away from interactions as “secondary” representations. The key assessment insight is that performance on all assessments is a special case of this situated knowledge. Situativity in prevailing approaches to coordination. Current approaches to coordination do not ignore situative theories. Knowing What Students Know (Pellegrino et al., 2001) acknowledges that this research “emphasizes that
MULTILEVEL ASSESSMENT
13
cognitive processes are embedded in social practices,” which suggests that “the performance of students on tests is understood as an activity in the situation presented by the test, and success depends on ability to participate in the practices of test taking” (Pellegrino et al., 2001, p. 187). But the implications of these assumptions are rather narrow: It follows that validation should include the collection of evidence that test takers have the communicative practices that are required for their responses to be actual indicators of such abilities as understanding and reasoning. The assumption that students have the necessary communicative skills has been demonstrated to be false in many occasions. (p. 187, emphasis added)
From a situative perspective “communicative practices” are not something that individuals “have” and that therefore can be assessed in a meaningful way; rather, they are rituals and routines that reside primarily in social and technological contexts. They emerge in interactions among the participants and features that define those contexts (Lave & Wenger, 1991). Thus, they cannot be meaningfully assessed at the individual level. Implicit in this approach to coordination is an “aggregative” reconciliation between individual and social activity in which the principles used to explain interactive social activity are compositions of the principles of individual activity (Greeno et al., 1996, p. 40).2 When used to coordinate formative and summative assessment practices an aggregative approach treats the summative evidence as an index of the aggregated individual learning that was supported by the formative assessment. For many this is desirable because it coherently embraces a single theory of learning. This makes it possible, for example, to examine factors such as implementation fidelity in order to rigorously examine whether teachers implemented embedded formative assessments, as expected, before examining the aggregated learning of their students (e.g., Furtak et al., 2008). But this coherence may undermine interactive participation as articulated previously and raises validity issues as discussed in the next section. Gitomer and Duschl’s (2007) consideration of coordination grants situative perspectives a larger role. They acknowledged that the situative assumption that learning primarily occurs in social interaction is antithetical to individual and standardized assessment: All assessments are proxies that can only approximate the measure of much broader constructs. Given the set of constraints that exist within our current educational system, we choose to strive for an accommodation of socio-cultural perspectives by 2 This aggregative approach is exemplified in many efforts to reconcile individual and social activity, including Bandura’s (2000) cognitive characterization of collective efficacy and Glenn’s (1991) behavioral characterization of metacontingencies.
14
HICKEY AND ZUIKER
attending to certain critical domain practices in our assessment framework, while acknowledging that we are not yet able to attend to all of those social practices. (pp. 291–292)
Thus, Gitomer and Duschl accommodated situative assumptions by expanding the array of relevant knowledge to include interactive social practices without formally assessing them. These practices include public displays of competence, engagement with the tools and practices of domains, and collaborative assessment practices; these are all aspects of the feedback conversations that emerged in the first project. In an important way this is consistent with the model of alignment that had emerged in the first project. No effort was made to formally assess shared activity in the feedback conversation; instead, shared activity was informally interpreted in a continual search for strategies to improve coordination. Explicitly situative approaches to coordination. Rather than just acknowledging or accommodating situative perspectives, it is possible to explicitly embrace situative perspectives in order to align the entire range of formative and summative assessment practices. This approach embraces a more “competitive” relation among perspectives whereby the situative perspective serves as a synthesis of different types of individual activity. When applied to formative and summative assessments, this treats all individual assessment practices as special cases of the interactive social practices that define the particular domain. When all assessments are treated as an element of a broader activity system, all interaction becomes assessment (Greeno & Gresalfi, 2008). This makes it possible to treat the activity at a more formal assessment level as a special case of the activity at the less formal level. Thus, the GenScope activities can be treated as formative assessments relative to the summative function of the Dragon Investigations and feedback conversations. At the same time the Dragon Investigations and feedback conversations could be assigned a formative function relative to the NewWorm assessment. This characterization of the range of instruction, assessment, and testing promises a pragmatic way of reconciling the tensions between those practices. It sidesteps the debate over “authentic” assessment (Wolf et al., 1991), suggesting that neither the NewWorm nor the Dragon Investigations were authentic in terms of the disciplinary practices that define inheritance. The practices embodied in both assessments are rather peculiar in this view but serve specific and potentially useful educational functions (Hickey & Zuiker, 2005). By treating all forms of social change (from fleeting conversations to long-term policy decisions) as learning, situativity argues that every assessment practice has potential formative and summative functions (Hickey & Pellegrino, 2005). As elaborated in the second study, this reframes the search for coherence as the process of balancing the varied formative and summative potential within and across different assessment
MULTILEVEL ASSESSMENT
15
levels. Doing so addresses a validity challenge that confounds efforts to coordinate practices and evaluate innovations. Validity Issues in Coordination and Evaluation Although the NewWorm was certainly a “test worth teaching to” (e.g., Yeh, 2001), its close alignment with the Dragon Investigation raised validity issues associated with appropriate preparation for end-of-instruction performance assessments. Mehrens, Popham, and Ryan (1998, pp. 20–21) provided six guidelines for proper preparation for performance assessments. Our feedback conversations were consistent with the last four: Make certain that the student is not surprised, and hence confused, by the performance assessment format; identify evaluative criteria in advance of instructional planning and communicate these to students; stress transferability of the skills and knowledge assessed during the performance task; and foster students’ self-evaluation skills. But what of the first two guidelines? The first guideline was as follows: Determine whether the interpretation to be drawn from students’ performance is related only to the specific task or whether an inference is to be made to a broader domain of performance. Obviously, the NewWorm was intended to make broader inferences about students’ ability, including solving other types of inheritance problems and succeeding in more advanced courses. The second guideline was as follows: When the inference is to the broader domain, one should not instruct in any fashion that would minimize the accuracy of the inference to the broader domain. This raises issues about the similarity between the Dragon Investigations and the NewWorm: This means, for example, that it would be inappropriate to spend more instructional time or effort on the performance of the specific task than on any other potential performance from which one could infer to the broader domain. For some performance tasks, in fact, it may be unethical to spend any time teaching to the specific performance task on the assessment. (Mehrens et al., 1998, p. 20)
On the one hand, the specific traits, representations, and problem formats differed systematically across the two assessments. In addition, the feedback conversations focused on participation in discourse about the underlying principles rather than on memorization of the steps involved in solving the problems. On the other hand, systematically creating the Dragon Investigations items to target problems on the NewWorm inevitably introduced some degree of construct-irrelevant variance (Messick, 1995) and, specifically, construct-irrelevant easiness. Construct-irrelevant variance and validity. Familiarizing students with the specific NewWorm problems meant that some of their improvement was due to specific procedural knowledge that might not transfer to other contexts beyond the NewWorm:
16
HICKEY AND ZUIKER
Teachers should not provide students with guided or independent practice on a task that is essentially identical to the task that will constitute the end of instruction performance test. The drawback of providing students with direct practice on tasks that are clones of the task embodied in an end of instruction performance test is that students may learn to master a specific type of task, yet be unable to generalize the skills and knowledge that have been learned to other somewhat dissimilar types of tasks. (Mehrens et al., 1998, p. 20, emphasis added)
The difference between “essentially identical” and “somewhat dissimilar” is subjective. Less obviously, it is remarkably complex. For example, both the Dragon Investigation items and the NewWorm items asked students to provide written explanations for their solutions. But the Dragon Investigation items were worded more informally, and students were never provided with a specific answer to specific problems that could simply be memorized. We might have systematically revised both assessments so that all of the items across the two were clearly dissimilar (i.e., different problems). In retrospect this would have been a daunting task for our project. This also would have introduced unsystematic variance that would have made the NewWorm less useful for informing the refinements to the Dragon Investigations and feedback conversations. The point is that it is difficult in theory and nearly impossible in practice to ascertain the level of construct-irrelevant variance across pairs of problems targeting the same concepts. Consider, for example, the systematic way the Dragon Investigations were created from the NewWorm assessment. This suggested a constant level of construct-irrelevant variance across problem pairs. But differences across pairs cloud this conclusion. With the Punnett square problems it is certainly possible for students to solve a specific problem (e.g., sex-linked inheritance) without understanding the fundamental mechanisms. Because there are finite ways to configure such a project, the pairs of items might be “clones.” And it is certainly possible that some students memorized written explanations without understanding them. Controlling for construct-relevant easiness across item pairs was easier with the effect-to-cause problems. With infinite configurations it is easier to create “somewhat dissimilar” items. But this still requires advanced knowledge of domain reasoning and assessment design. Construct-irrelevant easiness across items is further clouded by variation in formative feedback practices. Some students, groups, and classrooms more than others must have focused more directly on individual mastery of the various problems in the first study.3 These practices change and evolve over time, particularly as stakes are attached to assessment performance. 3 In ways that are too complex to address in the context of this paper, distinctions between identical and dissimilar items are confounded by assumptions about knowing and learning, and the assessment formats follow naturally for those assumptions.
MULTILEVEL ASSESSMENT
17
Construct-irrelevant easiness fuels the skepticism toward “researcherdeveloped” summative assessments expressed in the NRC’s articulation of scientifically based educational research (Shavelson & Towne, 2002). It also questioned the first project’s impact on science subtests of the high school exit exams that many participating students faced. Biology was one of three subscores in the state for which most of the data were collected; released tests showed that about a third of the biology items concerned inheritance. Some (but not all) addressed topics in the GenScope curriculum. Although we were confident of impact it was unclear whether the impact of the GenScope curriculum would be larger than that of conventional textbook curricula. Some of the high-stakes items targeted specific factual knowledge featured in textbooks; others were relatively simple effect-tocause items that might be learned more efficiently from more structured problem sets in the text. The second project attempted to address these concerns by adding an achievement test as an additional level of assessment.
THE SECOND GENSCOPE ASSESSMENT PROJECT The second project consisted of three annual cycles of refinement of the GenScope curriculum and the Dragon Investigations. A genetics achievement test was created using externally developed items. Like most achievement test items, these items were selected at random from a much larger pool of items based only on their relative difficulty. Each of the design cycles aimed to maximize the quality and quantity of discourse around the GenScope activities and Dragon Investigations in order to maximize gains on the NewWorm performance assessment. The broader consequences of these refinements were evaluated using this new achievement test. The study was proposed and initially set up to explore different ways of reconciling behavioral, cognitive, and sociocultural analyses of motivation as applied to different classroom grading practices. During the first year three different approaches to grading Dragon Investigations were explored (criterion-referenced, standards-referenced, and grade-referenced approaches). After the first year it was clear that all three had the same corrosive impact on engagement in the feedback conversations and on student motivation. The teachers stopped grading or even picking up the Dragon Investigations, and the project shifted focus to iteratively enhancing the feedback conversations and enhancing alignment across four levels of outcomes. Participants Across the 3 years, four teachers in three schools implemented the GenScope curriculum in 25 ninth-grade life science classes, all in a single major metropolitan area in the southeastern United States. The teachers were solicited via a district
18
HICKEY AND ZUIKER
TABLE 1 Description of Schools, Teachers, and Implementations Across Three Study Years School 1
School Demographics
Teacher
Classes Year 1 Classes Year 2
99% African American 30% lunch subsidy 61% passing science
GenScope A GenScope B Comparison C
3 3
2
40% African American 18% lunch subsidy 89% passing science
GenScope D
2
3
12% African American 1.5% lunch subsidy 95% passing science
GenScope E Comparison F
5 2
4
Classes Year 3 4 2
2
2
coordinator and were paid a $600 honorarium for each year they implemented GenScope; these teachers helped recruit two non-GenScope comparison teachers at two of the schools. Two of the GenScope teachers changed careers after the first year of the study. Given that the first year ended up serving as the baseline year, this paper only summarizes the results obtained in those classrooms. Table 1 shows the proportion of students at each school who (a) were African American, (b) qualified for the federal lunch subsidy, and (c) passed the science subtest of the high school exit exam on their first try (all from Year 1). Table 1 also lists the participating teachers and classes each year. School 1 served African American students and had the lowest passing rate (61%). GenScope Teacher A at School 1 implemented GenScope in three or four classes each year, whereas GenScope Teacher B implemented GenScope in three classes during Year 1. During the third year of the study Comparison Teacher C at School 1 invited the research team to collect comparison data in his classrooms, where the conventional text-based curriculum continued to be used to teach genetics. Whereas GenScope Teacher A had an undergraduate biology degree, Comparison Teacher C had a graduate degree in science education and had just begun work on his doctorate. Because most of the refinements across the years focused on GenScope Teacher A’s classrooms, and because valid comparison data were available, GenScope Teacher A became the focal teacher in this study. School 2 served a more diverse community. GenScope Teacher D implemented GenScope in two classes each year. Teacher D was a doctoral student in science education and also joined in the GenScope curriculum development effort as a research assistant. School 3 served an affluent community. GenScope Teacher E implemented GenScope in three regular classes and two life science classes at School 3 but left teaching after Year 1. Comparison data were collected in two regular life science classes taught by Comparison Teacher F at School 3.
MULTILEVEL ASSESSMENT
19
The three schools were in two large school districts that mandated a welldefined curriculum for ninth-grade life science courses, allocating roughly 1 month to genetics. Both the GenScope and comparison teachers were asked to allot 20 class periods to genetics. Thus, the GenScope curriculum, including the Dragon Investigations and feedback conversations, consumed the same number of instructional hours as the comparison genetics curriculum. All indications suggested that the comparison curriculum was similar to the GenScope curriculum. Both comparison teachers used the same required text, one of the most widely used in the United States. Both comparison teachers described their genetics curriculum as conventional “lecture–homework–quiz–exam” that closely followed the textbook. Because the comparison data in School 1 were collected during Year 3, a major change in student demographics presents a potential confound, but the population appeared stable. Within the constraints of classroom-based comparisons the two comparison teachers appear to offer valid comparisons for their corresponding GenScope teachers on pre/post data. Multilevel Assessment Methods The four levels of activity in this study were labeled using the categories (but not all of the definitions) of increasingly “distal” outcomes used in a summative evaluation of science reforms by Ruiz-Primo, Shavelson, Hamilton, and Klein (2002). They distinguished between five outcome levels: immediate, close, proximal, distal, and remote (cf. Kennedy, 1999). As shown in Table 2, the GenScope activities, the Dragon Investigations, the NewWorm assessment, and the genetics achievement test were respectively characterized as immediate, close, proximal, and distal outcomes in this study. Remote-level educational outcomes (norm-referenced achievement and beyond) were not considered in this study. Reflecting the situative assumption that all learning activity involves assessment (Greeno & Gresalfi, 2008), we did not make a sharp distinction between “instruction” and “assessment.” However, for clarity we refer to both the Dragon Investigations and the NewWorm as assessments. Rather than the common (but misleading) labels formative and summative, the labels informal and formal are used. This helps convey that each level is a point on a continuum that ranges from the informal enactment of the GenScope activities to the highly formal completion of the achievement test.4 Expanding the situative insights about multiple assessment functions (rather than intended purposes), we defined the formative and summative potential at 4 The names and designations of these levels evolved some across the course of our two studies. Although it is useful to debate the appropriate label or designation of levels, focusing excessively on the points and labels obscures the fundamental argument about the value of aligning activity across three or more points along this continuum.
20
Event-orientated activities
Activity-oriented investigations
Curriculumoriented assessment
Standardsoriented test
Immediate
Close
Proximal
Distal
Level
Assessment Orientation and Type
Genetics achievement test
NewWorm problem-solving assessment
Days, weeks
Months
Dragon Investigation formative assessments and feedback conversations
GenScope guided inquiries
Activity in This Study
Hours, days
Seconds, minutes
Timescale
Assess individual understanding of genetics problem solving
Individual understanding
Document impact on external achievement tests
Assess enactment of prior GenScope inquiry activities
Formal shared discourse
Aggregated achievement
Assess whether activities were enacted correctly
Summative Function
Informal shared discourse
Outcome
TABLE 2 Four Levels of Assessment Activity in the Study
Inform policymakers about innovation
Fine-tune the entire curriculum, including close-level assessments
Refine collective discourse and individual understanding
Prepare students to discuss inheritance problems
Formative Function
MULTILEVEL ASSESSMENT
21
each level in light of the other levels. In a very pragmatic way these characterizations organized the cycles of design-based refinement. Each of the four levels was further defined in terms of the approximate timescale (Lemke, 2000) of the learning assessed at that level, which in turn defined the orientation of assessment at each level. Framing the contexts in terms of increasingly lengthy timescales leads to the notion of embedded iterative refinements: The less formal formative functions at one level are embedded in the more formal formative functions at the next.5 This recognition of lengthening timescales across levels helps convey the underlying anthropological notion of prolepsis that framed the alignment of activity across levels. As articulated by Wertsch (1998) and Cole (1995), prolepsis refers to the way in which the prospect of future activity shapes current activity. In this framework the formative functions at each level are proleptically shaped by the summative function of the next level. This alignment of activity involves all of the participants, including the students (by providing an incentive to succeed at each level), teachers (by providing a goal to shape activity at each level), and researchers (by providing project goals and summative evidence of success). This alignment of more fleeting and contextualized learning at one level to the more stable and less contextualized learning at the next level provides the coherence for alignment. It stabilizes the degree of construct-irrelevant variance across two levels while protecting evidential validity at a third level. Practically speaking, this made the evidence from the proximal-level NewWorm trustworthy for refining the curricular activities and informal assessments, while making the distal-level data trustworthy for evaluating the impact of those improvements and estimating high-stakes impact. Following are explanations of the resources and guidelines at each level. They explain how the formative and/or summative functions were conceptualized and refined and include aspects and features that emerged during this project.
Immediate-Level Assessment of GenScope Activities At this level learning is situated in the unique enactment of the GenScope activities in each class. This learning occurs on a timescale of seconds to minutes as students work together to figure out how to complete the activities and understand the software. Assessment at the immediate level is oriented to the specific enactments, 5 The NewWorm and the achievement test were both administered at the same times before and after the curriculum. However, the NewWorm assessed knowledge that develops at a time span of weeks within a particular curriculum. The achievement test captured knowledge that develops more gradually and is assessed at a much longer timescale. More fundamental is that the NewWorm was used to guide refinements to the curriculum, whereas the achievement test was used to evaluate the broader impact of those refinements.
22
HICKEY AND ZUIKER
FIGURE 2 GenScope activity on sex-linked inheritance (condensed for space considerations).
or events. Hence, immediate-level assessment is deemed event oriented. Given the open-ended nature of these activities, the learning that occurs as students complete them is idiosyncratic, fleeting, and contextualized within the enactment. Students often named dragons for friends and family members and often invented terms to refer to features of the environment or phenomena they observed. The 20-hr GenScope curriculum was constructed of guided-inquiry activities using the GenScope software. The activities were 1- to 3-page exercises that structured students’ inquiry into key phenomena and that could each be completed within a single class period. These exercises were open ended and presented challenging questions that students could only answer by systematic investigation using the software. An example from the activity on sex-linked inheritance is in Figure 2. Fifteen activities were organized into four units. Each unit was completed in a single week of 50-min class periods—usually four class periods of GenScope activities followed by a Dragon Investigation and feedback conversation. Assessment at this level is exceedingly informal. It primarily consists of collaborative groups of students assessing whether they were enacting the activity as intended and assessing the consequences of the efforts with the GenScope software. Although the activities had been refined in the first study, the project continued to refine them based on informal observation by the research team. More systematic assessment at this level calls for interpretive methods such as discourse analysis that can account for the highly situated nature of that learning.6 6 A more formal examination of the construction of scientific identities of two students while completing the GenScope activities was completed from video recordings after this study was completed. That analysis did not directly inform this study and is not reported here.
MULTILEVEL ASSESSMENT
23
Close-Level Dragon Investigations and Feedback Conversations Learning at the close level occurred over a timescale of days as students completed the three to four activities in each of the four units. Because it is specific to unit activities, close-level learning and assessment is deemed activity oriented. This activity served an informal summative function for the immediate-level activity. That is, the design, enactment, and refinement of the GenScope activities were intended to better prepare students to participate more successfully in the Dragon Investigations and feedback conversations. Students were expected to appreciate that deeper shared engagement in the GenScope activities would better prepare them to (a) solve more problems on the Dragon Investigation, (b) make sense of the answer explanations, and (c) participate more successfully in the conversations. More broadly, the research team used the goal of enhancing the quality of these conversations to frame the alignment of the representations and guidelines across these first two levels. To reiterate, the Dragon Investigations used the organisms and representations from GenScope. The problems were intended to appear familiar and potentially solvable to students who had participated in the corresponding GenScope activities. An item on sex-linked inheritance is shown in Figure 3. But the Dragon Investigations were intended to direct more attention toward the underlying phenomenon in inheritance and away from the organisms, traits, and representations that were specific to GenScope. In the language of situativity theory (Greeno & the Middle School Mathematics Through Applications Project Group, 1998), the Dragon Investigations leveraged GenScope features in order to foreground invariant aspects (e.g., the distinct way all sex-linked traits are inherited) and background variant aspects (i.e., that fire breathing is a sex-linked trait in GenScope). Each of the 4-week-long units concluded with a Dragon Investigation. As elaborated below, in Year 1 the Dragon Investigations were used to assign grades; in Year 2 they were ungraded, but an individually completed cumulative Dragon Investigation was created and used for grading purposes. The answer explanations for each Dragon Investigation were intended to help students shift from the more vernacular labels and tacit recognition of phenomena during GenScope to more scientific labels and more explicit articulation of the phenomena of inheritance. The answer explanation for the sex-linked inheritance item is shown in Figure 4. Advanced prose and complex diagrams explained the reasoning behind each item without directly stating the correct answer. Written by a geneticist (Ann Kindfield), these explanations deliberately introduced new content not actually needed to answer the problem. Consequently, their readability exceeded college sophomore level according to Chall’s (1995) formula. The Dragon Investigations and answer explanations were designed specifically to support feedback conversations. This project started from the assumption that teacher-directed discussions were inherently evaluative. This made them more likely to undermine student participation. The teacher’s role in the feedback
24
HICKEY AND ZUIKER
FIGURE 3 Dragon Investigation formative assessment item on dominance relationships (condensed for space considerations).
conversations was framed as encouraging students to initially “pick up” the tools of the domain and begin to “try out” the identities associated with being a geneticist. Figuring out how to do this while maintaining accountability for individual understanding became the central focus of the refinements in this project. Assessment at the close level. Compared to immediate-level assessment, close-level assessment is less contextualized. Whereas each enactment of the GenScope activities was expected to be idiosyncratic and unique the project had specific ways of knowing and interacting in mind for the feedback conversations. That is, the project aimed to foster engaged participation in discourse in which specific scientific formalisms were enlisted initially and then appropriately. The answer explanations served a crucial role in helping students assess their own engagement; although teachers certainly monitored participation and behavior, they were asked to refer students to the answer explanations or peers rather than interpreting the answer explanations. Formal analyses of video-recorded feedback conversations were carried out across all 3 years of the study. Originally conceived as a means of illustrating
MULTILEVEL ASSESSMENT
FIGURE 4
25
Answer explanation for dominance relationship items.
the value of the situative reconciliation of behavioral, cognitive, and sociocultural analyses of the same events, representative samples of videotaped feedback conversations were analyzed multiple times. One analysis was led by Laura Fredrick, an applied behavior analyst whose specialization is direct instruction methods in K–12 settings (e.g., Fredrick, Deitz, Bryceland, & Hummel, 2000). A second analysis was led by Ann Kruger, a developmental psychologist who analyzes discourse to understand classroom culture and informal learning environments (e.g., Kruger & Tomasello, 1996). In Years 2 and 3 additional analysis was conducted by graduate research assistants Nancy Schafer and Mariana Michael, respectively, focusing on the level of domain-specific discourse of inheritance in conversational turns. The intended reconciliation proved problematic for methodological and theoretical reasons beyond the scope of this paper. The ideas about designing feedback conversations that emerged appeared more useful, particularly when embedded in a broader assessment design model. But this presented a challenge that is familiar to design researchers. Our assessments of engagement and the refinements that resulted were very closely bound to the specific features of our assessment context. The resulting evidence and the features used to enact the assessment
26
HICKEY AND ZUIKER
design principles that emerged may not readily generalize to other instructional contexts. What generalize are the broader ideas about assessing engagement in feedback conversations and the more general principles that emerged in our efforts to apply the resulting insights. Therefore, it seemed less useful to formally prove that our learning outcomes were uniquely the result of particular refinements; it seemed more useful to characterize a set of principles for directly examining discourse across assessment levels in order to iteratively enhance both instruction and assessment. Specifically, we suggest that the considerable resources needed to formally analyze feedback conversations be directed instead to less formal designbased refinements of the features that support such discourse and then used to evaluate those refinements using an individual assessment that features different problems and representations (e.g., our NewWorm). Proximal-Level NewWorm Performance Assessment Learning across GenScope activities and subsequent feedback conversations together was assessed using the NewWorm performance assessment. This learning takes place across several weeks and is specific to the curriculum. Thus, assessment and learning at this level is deemed curriculum oriented. A central thesis of this framework is that the formative functions of the NewWorm are relative and therefore limited to refining the curriculum and the Dragon Investigations. In other words, the NewWorm was not used by the teachers to inform instruction or to generate feedback for students. This helped fix the degree of construct-irrelevant easiness across the close and proximal levels. In turn it established the trustworthiness of NewWorm scores in a process of iterative refinement because the scores constituted valid evidence of individual understanding of the concepts targeted by the curriculum. To reiterate, the NewWorm assessment was organized around the distinction between simpler cause-to-effect problems and harder effect-to-cause problems and the distinction between easier within-generation problems and harder between-generation problems. An effect-to-cause/between-generation item concerning sex-linked inheritance is shown in Figure 5. The items were sequenced to scaffold student performance across increasingly complex problems: The initial items were solvable by most secondary students prior to genetics instruction, whereas the last items were shown to be challenging for biology graduate students and faculty. Reliability and validity had been documented in the first study (Hickey et al., 2000).7 Validity and reliability figures from the present study are 7 Rasch analysis showed that 71% of the items had a standardized infit mean square error with 2.0 SD. The assessment had a separation index of 5.0, which translates into eight statistically distinct strata of item difficulty. The reliability of the item difficulty indices was .96, and the reliability of the student scores was .87.
MULTILEVEL ASSESSMENT
27
FIGURE 5 NewWorm assessment item on dominance relationships (condensed for space considerations).
reported below. The NewWorm could be administered in one class period and scored efficiently and reliably (i.e.,