Sample Presentation Title Slide

Instructionally-Embedded Performance Assessments as Measures of Student Performance RUTH CHUNG WEI STANFORD UNIVERSITY JUNE 2015

Stanford Center for Assessment, Learning, & Equity

Validity Argument Instructionally embedded assessments are more closely connected to students’ opportunities to learn (curriculum/instruction) and are, therefore, more valid measures than external assessments.


Instructionally Embedded Assessments

External, Standardized Assessments

Closely tied to the taught curriculum and units of study

Loose or assumed connection to taught curriculum (through state standards) Teacher-developed or teacherExternally developed and selected validated Customized to school/classroom Standardized across schools context Scaffolded with opportunities for On-demand, no opportunities for feedback and revision feedback or revision Provides immediate, targeted information and feedback


Provides information annually, on broad level goals.

Instructionally Embedded Assessments

A1

A2

A3

A4

A5

Curriculum & Instruction

State Standards


External Standardized Assessments

Technical Challenges  Reliability • •

Item-level reliability Rater consistency (hand-scoring)

 Comparability • •

Tasks Scores


Research Questions LDC Performance Tasks • Can they be scored reliably? • Are they comparable? • Relationship to external measures?


Generalizability Study (Reliability) LDC Rubrics EstimatedScoring Reliability of the ELA Argumentation Task as a Function of Number of Raters Used to Calculate Scores

Rubric Dimension Focus Controlling Idea Reading Research Development Organization Conventions Content Understanding Total Score

1 0.82 0.84 0.87 0.71 0.71 0.77 0.83 0.89

Number of Raters 2 3 0.90 0.93 0.91 0.94 0.93 0.95 0.83 0.88 0.83 0.88 0.87 0.91 0.91 0.94

0.94 0.96

4 0.95 0.95 0.96 0.91 0.91 0.93 0.95 0.97

Comparability of LDC Tasks? Distribution of Holistic Ratings on LDC Task Jurying Rubric 25

21

Frequency

20 15

12

10 5

2

0

Work In Progress

Good to Go

Holistic Score

Exemplary

Differences in Average LDC Essay Scores by Holistic Task Quality Rating – ELA Tasks Average LDC Essay Scores By Holistic Ratings of Task Quality

Total Score

N

Mean

Std. Dev.

1=Work In Progress

454

11.831

4.6350

.2175

2=Good to Go

256

15.455

5.7812

.3613

3=Exemplary

35

14.814

5.2104

.8807

745

13.217

5.3658

.1966

ALL

Std. Error

Relationships with Other Measures?  LDC Essay scores are weakly related to the State Literature Exam and Grade 8 Reading Scores (no relationship with Grade 8 Writing Scores)  LDC Essay scores have significant but moderate correlation with ondemand Pre- and PostAssessments

LDC Writing Tasks are more challenging overall Distribution of Average LDC Essay Scores - ELA 35.0%

Percentage of Students

30.0%

29.2%

25.0%

19.9%

20.0%

14.8%

14.4%

15.0%

13.4%

10.0%

8.2%

5.0% 0.0% 1-1.4

1.5-1.9

2.0-2.4

2.5-2.9

3.0-3.4

3.5-4.0

Average Scores Across Dimensions

Distribution of Average LDC Essay Scores (Average of 7 Dimension Scores) 1=Not Yet, 2=Approaches Expectations, 3=Meets Expectations, 4=Advanced

Distribution of Scores on State Literature Exam

Scatter plot of State Literature Exam Scores and LDC Essay Total Scores

Students who score below average on the State Literature Exam (1532) but score relatively higher on the LDC Essay (“Meets Expectations” or “Advanced” scores)

LDC writing tasks are more accessible than on-demand writing assessments

Students with lowest possible score on an on-demand writing task have a wider range of scores on the LDC writing task

Summary of Findings • LDC essays can be scored reliably with sufficient rater training • Common scoring criteria support comparability of scoring instructionallyembedded assessments • LDC writing tasks are capturing some aspects of student proficiency in literacy/writing that are different from what an external standardized test measures • LDC writing tasks are more accessible than on-demand writing tasks and provide greater diagnostic information

Limitations BUT… • Teacher-designed assessments are not sufficiently comparable to be used for largescale, high-stakes purposes RECOMMEND… • Developing common task banks of vetted, piloted, and validated tasks, to which teachers can contribute their instructionally-embedded tasks for review (e.g., LDC’s Juried Task Bank, Common Assignments Study, Innovation Lab Network Performance Assessment Pilot task bank)