Principles of Psychometrics and Measurement Design Questionmark Analytics Austin Fossey
2014 Users Conference San Antonio | March 4th – 7th Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Austin Fossey Reporting and Analytics Manager, Questionmark
[email protected]
2014 Users Conference San Antonio
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Objectives
Learning Objectives:
Explain the differences between criterion, construct, and content validity Summarize a validity study Implement Toulmin’s structure to support argument-based validity Summarize the concept of reliability and its relationship to validity Define the three parts of the conceptual assessment framework
2014 Users Conference San Antonio
Slide 3
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Introduction
2014 Users Conference San Antonio
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Basic Terms
Measurement – assign scores/values based on a set of rules Testing – standardized procedure to collect information Assessment – collect information with an evaluative aspect Psychometrics – application of probabilistic model to make an inference about an unobserved/latent construct Construct – hypothetical concept that the test developers define based on theory, existing literature, and prior empirical work 2014 Users Conference San Antonio
Slide 5
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
“Where do I calculate validity?” Deep body of knowledge on best practices Well-defined criteria for assessment quality
Assessment tools & technology
Test Developers 2014 Users Conference San Antonio
Slide 6
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Validity Uses and Inferences
2014 Users Conference San Antonio
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Validity Survey
Do you have validity studies built into your test development process? Do you use the results to improve your assessments? Do you report your findings to stakeholders? Do you have a plan in place if there is evidence that an assessment is not valid?
2014 Users Conference San Antonio
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Defining Validity
Validity refers to proper inferences and uses of assessment results (Bachman, 2005).
Implies that the assessment itself is not valid Validity refers to how we interpret and use the assessment results
“Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests” (APA, NCME, & AERA, 1999).
2014 Users Conference San Antonio
Slide 9
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Defining Validity
Simple concept at first glance. . .
Validity is a continually evolving concept. Disagreements about what is important and what needs to be validated (Sireci, 2013).
Easy for there to be a lack of alignment between:
Validity theories Test development approaches and documentation Informed decision about the “defensibility and suitability” of results (Sireci, 2013)
2014 Users Conference San Antonio
Slide 10
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
“Where do I calculate validity?”
Modern validity studies are typically research projects with both a quantitative and qualitative element. Validity is no longer restricted to test scores.
Smarter Balanced Consortium’s program validity (Sireci, 2013)
The Standards: Test content 2. Response process 3. Internal structure 4. Relations to other variables 5. Testing consequences Integrate into a validity argument (APA, NCME, & AERA, 1999) 1.
2014 Users Conference San Antonio
Slide 11
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Validity Studies Common Types of Validity
2014 Users Conference San Antonio
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Validity and Reliability
Reliability is a measure of consistency. Expresses how well our observed scores relate to the true scores (Crocker & Algina, 2008) 𝜌𝑋𝑋 ′ 𝜎𝑇2
𝜎𝑇2 = 2 𝜎𝑋
is the variance of the true scores 𝜎𝑋2 is the variance of the observed scores
If our instrument is not reliable, our inferences are not valid.
We cannot trust the scores.
But just because an instrument is reliable does not mean our inferences our valid.
Still must demonstrate that we measure what we intend to and draw the correct inferences from the results 2014 Users Conference San Antonio
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Criterion-Related Validation
Demonstrate that assessment scores have a relation to a relevant criterion that relates to the inferences or uses surrounding the assessment results. Concurrent – Relationship between the assessment scores and a criterion measure taken at the same time. Predictive – Relationship between the assessment scores and a criterion measure taken in the future.
2014 Users Conference San Antonio
Slide 14
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Criterion-Related Validation Examples
Concurrent: do scores on the written drivers license assessment correlate with performance in the on-theroad test taken the same day?
Predictive: do SAT scores correlate with students’ first semester GPA in college?
2014 Users Conference San Antonio
Slide 15
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Criterion-Related Validation Study
Criterion-Related Validation Study (Crocker & Algina, 2008) 1. 2. 3. 4. 5.
Identify a suitable criterion behavior and a method for measuring it. Identify a representative sample of participants. Administer the assessment and record the scores. Obtain a criterion measure from each participant in the sample when they become available. Determine the strength of the relationship between the assessment scores and the criterion measures.
2014 Users Conference San Antonio
Slide 16
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Criterion-Related Validation in Practice
Criterion problem – the criterion of interest may be a very complex construct (e.g., teaching effectiveness).
Sample size – small sample sizes will not yield accurate validity coefficients.
May require in-depth, ongoing measures of the criterion to validate the assessment results.
Study may need to collect research from studies of similar predictors as evidence of criterion-related validity.
Criterion contamination – assessment scores affect criterion measures (dependence). Restriction of Range – Systematically missing some measures in criterion. From Crocker & Algina, 2008 2014 Users Conference San Antonio
Slide 17
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Reporting Criterion-Related Results
Report statistics related to the relation between the assessment scores and the criterion measure. Report standard errors of measurement and reliability coefficients for the assessment and the criterion (if appropriate). Visualize the relation with expectancy table (Crocker & Algina, 2008).
2014 Users Conference San Antonio
Slide 18
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Criterion-Related Expectancy Table Assessment Score Range
% Hired Above Entry Level
% Hired at Entry Level
% Not Hired
Number of Applicants
0 - 20
___
___
100%
3
20 - 40
___
20%
80%
20
40 - 60
13%
75%
13%
24
60 - 80
30%
60%
10%
10
80 - 100
100%
___
___
5
Total Applicants
11
28
23
62
Adapted from Crocker & Algina, 2008 2014 Users Conference San Antonio
Slide 19
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Content Validity
Demonstrate that items adequately represent the construct being measured. This requires that the construct be defined with a set of learning objectives or tasks, such as those determined in a Job Task Analysis study. Content validity studies take place after the assessment is constructed. The study should use a set of subject matter experts who are independent from those who wrote the items and constructed the forms.
2014 Users Conference San Antonio
Slide 20
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Content Validation Study
Content Validation Study (Crocker & Algina, 2008) Define the construct or performance domain (e.g., job task analysis, cognitive task analysis) 2. Recruit independent panel of subject matter experts 3. Provide panel with structured framework and documented instructions for the process of matching items to the construct 4. Collect, summarize, and report the results 1.
2014 Users Conference San Antonio
Slide 21
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Content Validation in Practice
Items can be weighted by importance to determine representation on the assessment (e.g. JTA results). If this is done, requires specific definition of “importance.” The process for matching items to objectives needs to be defined in advance. Reviewers also need to know which aspects of an item are supposed to be matched to objectives. Study may be flawed if the objectives do not properly represent the construct. From Crocker & Algina, 2008 2014 Users Conference San Antonio
Slide 22
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Reporting Content Validation Results
Percentage of items matched to objectives (Crocker & Algina, 2008) Percentage of items matched to high-importance objectives (Crocker & Algina, 2008) Percentage of objectives not assessed by any items (Crocker & Algina, 2008) Correlation between objectives’ importance ratings and number of items matched to those objectives (Klein & Kosecoff, 1975) Index of item-objective congruence (Rovinelli & Hambleton, 1977) 2014 Users Conference San Antonio
Slide 23
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Index of Item-Objective Congruence
Assumes that each item should measure one and only one objective. Raters score item with +1 if there is a match, 0 if there is uncertainty, and -1 if it does not match the objective (Rovinelli & Hambleton, 1977). 𝑁 𝐼𝑖𝑘 = (𝜇𝑘 − 𝜇) 2𝑁 − 2 𝐼𝑖𝑘
is the index of item-objective congruence for item i on objective k. 𝑁 is the number of objectives 𝜇𝑘 is the mean rating of item i on objective k. 𝜇 is the mean rating of item i across all objectives. 2014 Users Conference San Antonio
Slide 24
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Construct Validity
Uses assessment scores and supporting evidence to support a theory of a nomological network:
How does a construct relate to observed (measurable) variables? How does a construct relate to other constructs, as represented by other observed variables?
Construct 1
Observed A
Observed B
Construct 2
Observed C
Observed D
Construct 3
Observed E
Observed F
Observed G
Sample Nomological Network 2014 Users Conference San Antonio
Slide 25
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Construct Validation Study
Construct Validation Study (Crocker & Algina, 2008) 1.
2. 3. 4.
Explicitly define theory of how those who differ on the assessed construct will differ in terms of demographics, performance, or other validated constructs. Administer assessment that has items which are specific, concrete manifestations of the construct. Gather data for other nodes in nomological network to test hypothesized relationships. Determine if data are consistent with the original theory, and consider other possible conflicting theories (rebuttals).
2014 Users Conference San Antonio
Slide 26
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Construct Validation in Practice
Possibly one of the more difficult validity studies to complete. Can require a lot of data and research. Statistical approaches include multiple regression analysis or factor analysis, but can also use correlations as in multi-trait/multi-method matrix. In experimental scenarios, it is difficult to diagnose why relationships are not found.
Bad theory? Bad instrument? Both? From Crocker & Algina, 2008 2014 Users Conference San Antonio
Slide 27
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Reporting Construct Validity Results
A common method for reporting construct validity is with a multi-trait multi-method matrix (Crocker & Algina, 2008).
Measuring the same construct with different methods should yield similar results. In practice, the data may come from different studies (not ideal).
2014 Users Conference San Antonio
Slide 28
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Multi-Trait Multi-Method Matrix Method
True False
Trait A
B
C
Force Resp. A
B
C
Inc. Sent. A
B
C
1. True False
A. Sex-Guilt
.95
B. Hostility-Guilt
.28 .86
C. Morality-Conscience
.58 .39 .92
2. Forced Response A. Sex-Guilt
.86 .32 .57 .95
B. Hostility-Guilt
.30 .90 .40 .39 .76
C. Morality-Conscience
.52 .31 .86 .55 .26 .84
Reliability Mono-trait / Hetero-method Hetero-trait / Mono-method Hetero-trait / Hetero-method
Incomplete Sentences
A. Sex-Guilt
.73 .10 .43 .64 .17 .37 .48
B. Hostility-Guilt
.10 .63 .17 .22 .67 .19 .15 .41
C. Morality-Conscience
.35 .16 .52 .31 .17 .56 .41 .30 .58
From Mosher, 1968
2014 Users Conference San Antonio
Slide 29
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Argument-Based Validity
Criterion, content, and construct validity are crucial aspects of assessment result validity, but how do we demonstrate the link to the inferences and uses of the assessment results? Argument-based validity (e.g., Kane, 1992) provides logic using Toulmin’s structure of an argument to support claims about inferences. Bachman (2005) expands this to include validity arguments for use cases.
2014 Users Conference San Antonio
Slide 30
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Example Toulmin Structure for Validity Inference Claim: Mike cannot make a sandwich
Unless
Rebuttal: Too many questions were about the bread, and Mike did not have sufficient opportunity to demonstrate knowledge of ingredients and layering standards
Since Warrant: Poor performance on the sandwich exam correlates with low performance of making sandwiches
Rejects
So Supports
Backing Evidence: Criterion validity study of sandwich exam scores and sandwich assembly performance at the sandwich shop
Rebuttal Evidence: Content validity study confirms that items are categorized correctly for blueprint. Blueprint is based on results of JTA. There were not too many questions about bread.
Data: Mike got a failing score on his exam about making a sandwich
2014 Users Conference San Antonio
Slide 31
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Argument-Based Validity for Use Cases
Bachman (2005) defines four decision (use case) warrants that should be addressed with a validity argument for each use case associated with the assessment results: Is the interpretation of the score relevant to the decision being made? Is the interpretation of the score useful for the decision being made? Are the intended consequences of the assessment beneficial for the stakeholders? Does the assessment provide sufficient information for making the decision?
2014 Users Conference San Antonio
Slide 32
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Argument-Based Validation Study
Argument-Based Validation Study (Chapelle, Enright, & Jamieson, 2010) Identify inferences, the warrants leading to these inferences, and the assumptions underlying the warrants. Document these inferences. 2. Identify or collect evidence backing the assumptions for the warrants. Document this evidence. 3. Identify or collect rebuttals, and document evidence supporting or refuting the rebuttal. Document this evidence with the evidence backing the assumptions. 1.
2014 Users Conference San Antonio
Slide 33
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Utility of Argument-Based Validity
“Validation can be viewed as developing a scientifically sound validity argument to support the intended interpretation of test scores and their relevance to the proposed use” (APA, NCME, & AERA, 1999). Argument-based validity forces us to look at and document the logical connections between classic validity studies and the real world defensibility of our assessment (Sireci, 2013).
2014 Users Conference San Antonio
Slide 34
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Utility of Argument-Based Validity
By requiring test developers to research inferences and build the argument structure to support these inferences, we can avoid three common fallacies of validity studies:
Taking inferences and their assumptions as “givens” Making overly-ambitious, unrealistic inferences Making a claim of validity by selectively choosing evidence while glossing over evidence of weaknesses in the inferences
From Kane, 2006 2014 Users Conference San Antonio
Slide 35
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Evidence-Centered Design (ECD) A Principled Test Development Framework
2014 Users Conference San Antonio
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Principled Test Development Frameworks
Frameworks for how to connect assessment tools and practices to reach desired goals for assessment quality. Practical methods for implementing assessment design and development Guides test developers to make thoughtful, explicit decision Improve the efficiency and effectiveness of item/task development Typically supports the documentation of evidence needed to support argument-based validity
Helps with increased design decision and granular data needs while minimizing construct-irrelevant variance. From Ferrara, Nichols, & Lai, 2013 2014 Users Conference San Antonio
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Examples of Principled Test Development Frameworks
Diagnostic Assessment Framework Construct-Centered Framework Evidence-Centered Design Assessment Engineering Principled Design for Efficacy
From Ferrara, Nichols, & Lai, 2013 2014 Users Conference San Antonio
Slide 38
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Principled Test Development Survey
Do you use a principled test development framework? Have you wanted to try to implement a principled test development framework, but been deterred because it seems like too much work?
2014 Users Conference San Antonio
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Evidence-Centered Design
ECD is a framework for assessment development that is designed to create the evidence needed to support assessment inferences as the assessment is being built (e.g., Mislevy, 2011; Mislevy et al., 2012,). Applies a broad range of assessment design resources (e.g., subject matter knowledge, software design, psychometrics, pedagogical knowledge) to the inferences. Avoids awkward situation of finding validity problems after the assessment has already been built. 2014 Users Conference San Antonio
Slide 40
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
ECD Process Domain Analysis Domain Modeling
• What is important about this domain (construct)? • What work and situations are central to this domain? • How do we represent the aspects from the domain analysis as assessment arguments?
Conceptual Assessment Framework
• Design structures: student model, evidence model, and task model
Assessment Implementation
• Building the assessment: item writing, scoring engines, statistical models
Assessment Delivery
• Participants interact with items/tasks. Performance is evaluated, and results and feedback are reported.
From Mislevy et al., 2012 2014 Users Conference San Antonio
Slide 41
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
ECD Flexibility
ECD is designed to be flexible enough to accommodate any assessment design.
Different construct modeling approaches New item types and assessment formats with new technology Different scoring models, or combinations of scoring models Growing use of assessment scores and inferences
ECD vocabulary and process aligns test development work across disciplines
Documents how test development outcomes connect Common vocabulary helps people understand what they are doing and why 2014 Users Conference San Antonio
Slide 42
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Example of ECD: IMMEX True Roots
Educational game to measure cognitive behavior based on sequence of participants’ actions Captures sequence of responses in the game Classifies sequence with artificial neural network Designed and reported with ECD (Stevens & Casillas, 2006)
True Roots problem space (Cox Jr., Jordan, Cooper, & Stevens, 2004)
2014 Users Conference San Antonio
Slide 43
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Conceptual Assessment Framework (CAF)
ECD may be a lot to implement for every assessment, but the principles can still help guide our test development work. The CAF represents the keystone of ECD. This is how we begin to explain the intellectual leap from scores to inferences. Three parts of an assessment’s CAF:
Task Model Student Model Evidence Model 2014 Users Conference San Antonio
Slide 44
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Conceptual Assessment Framework (CAF)
Student Model
Evidence Model
Task Model
• Evaluation Component • Measurement Model Component
From Mislevy et al., 2012 2014 Users Conference San Antonio
Slide 45
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
CAF: Task Model
Defines the assumptions and specifications for what the participant can do in the assessment and the features of the environment in which the task takes place (Mislevy et al., 2012). Examples of task model decisions:
Item format and content Delivery format (random delivery, time limits) Resources Translations or accommodations Response format 2014 Users Conference San Antonio
Slide 46
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
CAF: Student Model
Defines the construct and construct relationships that are being measured from which we will make an inference (Mislevy et al., 2012). Examples of student model decisions:
Total score rules and interpretations Topic score rules and interpretations Rubric structure Conditional delivery (e.g., jump blocks, CAT)
2014 Users Conference San Antonio
Slide 47
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
CAF: Evidence Model
Connects the task model and the student model (Mislevy et al., 2012).
Evaluation Component – Defines how evidence is identified from the responses generated within the task model. Rules for identifying correct responses Rules for what aspects of a response to observe in performance task or human-scored item Sequence and tagging
Measurement Model – Aggregates response data to yield inferences about the student. Item scoring and outcomes Weighting and scaling Aggregation models (e.g., CTT, IRT, Bayes nets, regression, network analysis) 2014 Users Conference San Antonio
Slide 48
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Reporting the CAF
There will be blurred lines between the three elements of CAF, but this is because the three models are interdependent. Documenting the CAF is becoming more common in the literature. Provides defensibility by being able to demonstrate how your instrument is collecting and scoring evidence about the construct to support specific inferences. Naturally lends itself to argument-based validity. This is the evidence needed to support many of your warrants. 2014 Users Conference San Antonio
Slide 49
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
Thank you! Austin Fossey Reporting and Analytics Manager, Questionmark
[email protected]
2014 Users Conference San Antonio
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
References
American Psychological Association, National Council of Measurement in Education, American Educational Research Association. (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association. Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 2(1), 1-34. Chapelle, C. A., Enright, M. K., & Jamieson, J. Cox Jr., C. T., Jordan, J., Cooper, M. M., Stevens, R. (2004). Assessing student understanding with technology: the use of IMMEX problems in the science classroom. Retrieved from http://www.ces.clemson.edu/IMMEX/Charlie/ on February 21, 2014. Crocker, L, & Algina, J. (2008). Introduction to Classical and Modern Test Theory. Mason, OH: Cengage Learning 2014 Users Conference San Antonio
Slide 51
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
References
Ferrara, S., Nichols, P., & Lai, E. (2013). Design and development for next generation tests: Principled design for efficacy (PDE). Proceedings from the Maryland Assessment Research Center Conference. Retrieved from http://marces.org/conference/commoncore/MARCES_SteveFerrara.pdf on February 20, 2014. Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527-535. Kane, M. (1996). Validation. In R. L. Brennan (Ed.), Educational Measurement: Fourth Edition (17-64). Westport, CT: Praeger Publishers. Klein, S. P., & Kosekoff, J. P. (1975). Determing how well a test measures your objectives. (CSE Report No. 94). Los Angeles, CA: Center for the Study of Evaluation, University of California.
2014 Users Conference San Antonio
Slide 52
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
References
Mislevy, R.J. (2011). Evidence-centered design for simulation-based assessment. (CRESST Report 800). Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Mislevy, R. J., Behrens, J. T., Dicerbo, K. E., & Levy, R. (2012). Design and discovery in educational assessment: Evidence-Centered Design, psychometrics, and Educational Data Mining. Journal of Educational Data Mining, 4(1), 11-48. Mosher, D. L. (1968). Measurement of guilt by self-report inventories. Journal of Consulting and Clinical Psychology, 32, 690-695. Rovinelli, R. J., & Hambleton, R. K. (1977). On the use of content specialists in the assessment of criterion-referenced test item validity. Dutch Journal of Educational Research, 2, 49-60. 2014 Users Conference San Antonio
Slide 53
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.
References
Sireci, S. G. (2013). A theory of action of test validation. Proceedings from the Maryland Assessment Research Center Conference. Retrieved from http://marces.org/conference/commoncore/MARCES_SteveSireci.pdf on February 20, 2014. Stevens, R. H., & Casillas, A. (2006). Artificial neural networks. In D. M. Williamson, R. J. Mislevy, & I. I. Bejar (Eds.), Automated scoring of complex tasks in computer-based testing (pp. 259-312). Mahwah, NJ: Erlbaum.
2014 Users Conference San Antonio
Slide 54
Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.