Customized to school/classroom context. Standardized across schools. Scaffolded with opportunities for feedback and revi
Instructionally-Embedded Performance Assessments as Measures of Student Performance RUTH CHUNG WEI STANFORD UNIVERSITY JUNE 2015
Stanford Center for Assessment, Learning, & Equity
Validity Argument Instructionally embedded assessments are more closely connected to students’ opportunities to learn (curriculum/instruction) and are, therefore, more valid measures than external assessments.
Stanford Center for Assessment, Learning, & Equity
Instructionally Embedded Assessments
External, Standardized Assessments
Closely tied to the taught curriculum and units of study
Loose or assumed connection to taught curriculum (through state standards) Teacher-developed or teacherExternally developed and selected validated Customized to school/classroom Standardized across schools context Scaffolded with opportunities for On-demand, no opportunities for feedback and revision feedback or revision Provides immediate, targeted information and feedback
Stanford Center for Assessment, Learning, & Equity
Provides information annually, on broad level goals.
Instructionally Embedded Assessments
A1
A2
A3
A4
A5
Curriculum & Instruction
State Standards
Stanford Center for Assessment, Learning, & Equity
External Standardized Assessments
Technical Challenges Reliability • •
Item-level reliability Rater consistency (hand-scoring)
Comparability • •
Tasks Scores
Stanford Center for Assessment, Learning, & Equity
Research Questions LDC Performance Tasks • Can they be scored reliably? • Are they comparable? • Relationship to external measures?
Stanford Center for Assessment, Learning, & Equity
Generalizability Study (Reliability) LDC Rubrics EstimatedScoring Reliability of the ELA Argumentation Task as a Function of Number of Raters Used to Calculate Scores
Rubric Dimension Focus Controlling Idea Reading Research Development Organization Conventions Content Understanding Total Score
1 0.82 0.84 0.87 0.71 0.71 0.77 0.83 0.89
Number of Raters 2 3 0.90 0.93 0.91 0.94 0.93 0.95 0.83 0.88 0.83 0.88 0.87 0.91 0.91 0.94
0.94 0.96
4 0.95 0.95 0.96 0.91 0.91 0.93 0.95 0.97
Comparability of LDC Tasks? Distribution of Holistic Ratings on LDC Task Jurying Rubric 25
21
Frequency
20 15
12
10 5
2
0
Work In Progress
Good to Go
Holistic Score
Exemplary
Differences in Average LDC Essay Scores by Holistic Task Quality Rating – ELA Tasks Average LDC Essay Scores By Holistic Ratings of Task Quality
Total Score
N
Mean
Std. Dev.
1=Work In Progress
454
11.831
4.6350
.2175
2=Good to Go
256
15.455
5.7812
.3613
3=Exemplary
35
14.814
5.2104
.8807
745
13.217
5.3658
.1966
ALL
Std. Error
Relationships with Other Measures? LDC Essay scores are weakly related to the State Literature Exam and Grade 8 Reading Scores (no relationship with Grade 8 Writing Scores) LDC Essay scores have significant but moderate correlation with ondemand Pre- and PostAssessments
LDC Writing Tasks are more challenging overall Distribution of Average LDC Essay Scores - ELA 35.0%
Percentage of Students
30.0%
29.2%
25.0%
19.9%
20.0%
14.8%
14.4%
15.0%
13.4%
10.0%
8.2%
5.0% 0.0% 1-1.4
1.5-1.9
2.0-2.4
2.5-2.9
3.0-3.4
3.5-4.0
Average Scores Across Dimensions
Distribution of Average LDC Essay Scores (Average of 7 Dimension Scores) 1=Not Yet, 2=Approaches Expectations, 3=Meets Expectations, 4=Advanced
Distribution of Scores on State Literature Exam
Scatter plot of State Literature Exam Scores and LDC Essay Total Scores
Students who score below average on the State Literature Exam (1532) but score relatively higher on the LDC Essay (“Meets Expectations” or “Advanced” scores)
LDC writing tasks are more accessible than on-demand writing assessments
Students with lowest possible score on an on-demand writing task have a wider range of scores on the LDC writing task
Summary of Findings • LDC essays can be scored reliably with sufficient rater training • Common scoring criteria support comparability of scoring instructionallyembedded assessments • LDC writing tasks are capturing some aspects of student proficiency in literacy/writing that are different from what an external standardized test measures • LDC writing tasks are more accessible than on-demand writing tasks and provide greater diagnostic information
Limitations BUT… • Teacher-designed assessments are not sufficiently comparable to be used for largescale, high-stakes purposes RECOMMEND… • Developing common task banks of vetted, piloted, and validated tasks, to which teachers can contribute their instructionally-embedded tasks for review (e.g., LDC’s Juried Task Bank, Common Assignments Study, Innovation Lab Network Performance Assessment Pilot task bank)