Principles of Psychometrics and Measurement Design

Principles of Psychometrics and Measurement Design Questionmark Analytics Austin Fossey

2014 Users Conference San Antonio | March 4th – 7th Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.

Austin Fossey Reporting and Analytics Manager, Questionmark austin.fossey@questionmark.com

2014 Users Conference  San Antonio

Copyright © 1995-2014 Questionmark Corporation and/or Questionmark Computing Limited, known collectively as Questionmark. All rights reserved. Questionmark is a registered trademark of Questionmark Computing Limited. All other trademarks are acknowledged.

Objectives 

Learning Objectives:     

Explain the differences between criterion, construct, and content validity Summarize a validity study Implement Toulmin’s structure to support argument-based validity Summarize the concept of reliability and its relationship to validity Define the three parts of the conceptual assessment framework


Slide 3


Introduction



Basic Terms    



Measurement – assign scores/values based on a set of rules Testing – standardized procedure to collect information Assessment – collect information with an evaluative aspect Psychometrics – application of probabilistic model to make an inference about an unobserved/latent construct Construct – hypothetical concept that the test developers define based on theory, existing literature, and prior empirical work 2014 Users Conference  San Antonio

Slide 5


“Where do I calculate validity?” Deep body of knowledge on best practices Well-defined criteria for assessment quality

Assessment tools & technology

Test Developers 2014 Users Conference  San Antonio

Slide 6


Validity Uses and Inferences



Validity Survey 

 



Do you have validity studies built into your test development process? Do you use the results to improve your assessments? Do you report your findings to stakeholders? Do you have a plan in place if there is evidence that an assessment is not valid?



Defining Validity 

Validity refers to proper inferences and uses of assessment results (Bachman, 2005).  



Implies that the assessment itself is not valid Validity refers to how we interpret and use the assessment results

“Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests” (APA, NCME, & AERA, 1999).


Slide 9


Defining Validity 

Simple concept at first glance. . .  



Validity is a continually evolving concept. Disagreements about what is important and what needs to be validated (Sireci, 2013).

Easy for there to be a lack of alignment between:   

Validity theories Test development approaches and documentation Informed decision about the “defensibility and suitability” of results (Sireci, 2013)


Slide 10


“Where do I calculate validity?” 



Modern validity studies are typically research projects with both a quantitative and qualitative element. Validity is no longer restricted to test scores. 

Smarter Balanced Consortium’s program validity (Sireci, 2013)



The Standards: Test content 2. Response process 3. Internal structure 4. Relations to other variables 5. Testing consequences Integrate into a validity argument (APA, NCME, & AERA, 1999) 1.


Slide 11


Validity Studies Common Types of Validity



Validity and Reliability 

Reliability is a measure of consistency. Expresses how well our observed scores relate to the true scores (Crocker & Algina, 2008) 𝜌𝑋𝑋 ′  𝜎𝑇2

𝜎𝑇2 = 2 𝜎𝑋

is the variance of the true scores  𝜎𝑋2 is the variance of the observed scores



If our instrument is not reliable, our inferences are not valid. 



We cannot trust the scores.

But just because an instrument is reliable does not mean our inferences our valid. 

Still must demonstrate that we measure what we intend to and draw the correct inferences from the results 2014 Users Conference  San Antonio


Criterion-Related Validation 





Demonstrate that assessment scores have a relation to a relevant criterion that relates to the inferences or uses surrounding the assessment results. Concurrent – Relationship between the assessment scores and a criterion measure taken at the same time. Predictive – Relationship between the assessment scores and a criterion measure taken in the future.


Slide 14


Criterion-Related Validation Examples 

Concurrent: do scores on the written drivers license assessment correlate with performance in the on-theroad test taken the same day?



Predictive: do SAT scores correlate with students’ first semester GPA in college?


Slide 15


Criterion-Related Validation Study 

Criterion-Related Validation Study (Crocker & Algina, 2008) 1. 2. 3. 4. 5.

Identify a suitable criterion behavior and a method for measuring it. Identify a representative sample of participants. Administer the assessment and record the scores. Obtain a criterion measure from each participant in the sample when they become available. Determine the strength of the relationship between the assessment scores and the criterion measures.


Slide 16


Criterion-Related Validation in Practice 

Criterion problem – the criterion of interest may be a very complex construct (e.g., teaching effectiveness). 



Sample size – small sample sizes will not yield accurate validity coefficients. 

 

May require in-depth, ongoing measures of the criterion to validate the assessment results.

Study may need to collect research from studies of similar predictors as evidence of criterion-related validity.

Criterion contamination – assessment scores affect criterion measures (dependence). Restriction of Range – Systematically missing some measures in criterion. From Crocker & Algina, 2008 2014 Users Conference  San Antonio

Slide 17


Reporting Criterion-Related Results 





Report statistics related to the relation between the assessment scores and the criterion measure. Report standard errors of measurement and reliability coefficients for the assessment and the criterion (if appropriate). Visualize the relation with expectancy table (Crocker & Algina, 2008).


Slide 18


Criterion-Related Expectancy Table Assessment Score Range

% Hired Above Entry Level

% Hired at Entry Level

% Not Hired

Number of Applicants

0 - 20

___

___

100%

3

20 - 40

___

20%

80%

20

40 - 60

13%

75%

13%

24

60 - 80

30%

60%

10%

10

80 - 100

100%

___

___

5

Total Applicants

11

28

23

62

Adapted from Crocker & Algina, 2008 2014 Users Conference  San Antonio

Slide 19


Content Validity 



Demonstrate that items adequately represent the construct being measured. This requires that the construct be defined with a set of learning objectives or tasks, such as those determined in a Job Task Analysis study. Content validity studies take place after the assessment is constructed. The study should use a set of subject matter experts who are independent from those who wrote the items and constructed the forms.


Slide 20


Content Validation Study 

Content Validation Study (Crocker & Algina, 2008) Define the construct or performance domain (e.g., job task analysis, cognitive task analysis) 2. Recruit independent panel of subject matter experts 3. Provide panel with structured framework and documented instructions for the process of matching items to the construct 4. Collect, summarize, and report the results 1.


Slide 21


Content Validation in Practice 





Items can be weighted by importance to determine representation on the assessment (e.g. JTA results). If this is done, requires specific definition of “importance.” The process for matching items to objectives needs to be defined in advance. Reviewers also need to know which aspects of an item are supposed to be matched to objectives. Study may be flawed if the objectives do not properly represent the construct. From Crocker & Algina, 2008 2014 Users Conference  San Antonio

Slide 22


Reporting Content Validation Results    



Percentage of items matched to objectives (Crocker & Algina, 2008) Percentage of items matched to high-importance objectives (Crocker & Algina, 2008) Percentage of objectives not assessed by any items (Crocker & Algina, 2008) Correlation between objectives’ importance ratings and number of items matched to those objectives (Klein & Kosecoff, 1975) Index of item-objective congruence (Rovinelli & Hambleton, 1977) 2014 Users Conference  San Antonio

Slide 23


Index of Item-Objective Congruence 

Assumes that each item should measure one and only one objective. Raters score item with +1 if there is a match, 0 if there is uncertainty, and -1 if it does not match the objective (Rovinelli & Hambleton, 1977). 𝑁 𝐼𝑖𝑘 = (𝜇𝑘 − 𝜇) 2𝑁 − 2  𝐼𝑖𝑘   

is the index of item-objective congruence for item i on objective k. 𝑁 is the number of objectives 𝜇𝑘 is the mean rating of item i on objective k. 𝜇 is the mean rating of item i across all objectives. 2014 Users Conference  San Antonio

Slide 24


Construct Validity 

Uses assessment scores and supporting evidence to support a theory of a nomological network:  

How does a construct relate to observed (measurable) variables? How does a construct relate to other constructs, as represented by other observed variables?

Construct 1

Observed A

Observed B

Construct 2

Observed C

Observed D

Construct 3

Observed E

Observed F

Observed G

Sample Nomological Network 2014 Users Conference  San Antonio

Slide 25


Construct Validation Study 

Construct Validation Study (Crocker & Algina, 2008) 1.

2. 3. 4.

Explicitly define theory of how those who differ on the assessed construct will differ in terms of demographics, performance, or other validated constructs. Administer assessment that has items which are specific, concrete manifestations of the construct. Gather data for other nodes in nomological network to test hypothesized relationships. Determine if data are consistent with the original theory, and consider other possible conflicting theories (rebuttals).


Slide 26


Construct Validation in Practice 





Possibly one of the more difficult validity studies to complete. Can require a lot of data and research. Statistical approaches include multiple regression analysis or factor analysis, but can also use correlations as in multi-trait/multi-method matrix. In experimental scenarios, it is difficult to diagnose why relationships are not found.   

Bad theory? Bad instrument? Both? From Crocker & Algina, 2008 2014 Users Conference  San Antonio

Slide 27


Reporting Construct Validity Results 

A common method for reporting construct validity is with a multi-trait multi-method matrix (Crocker & Algina, 2008).  

Measuring the same construct with different methods should yield similar results. In practice, the data may come from different studies (not ideal).


Slide 28


Multi-Trait Multi-Method Matrix Method

True False

Trait A

B

C

Force Resp. A

B

C

Inc. Sent. A

B

C

1. True False

 

A. Sex-Guilt

.95

B. Hostility-Guilt

.28 .86

C. Morality-Conscience

.58 .39 .92





2. Forced Response A. Sex-Guilt

.86 .32 .57 .95

B. Hostility-Guilt

.30 .90 .40 .39 .76


.52 .31 .86 .55 .26 .84

Reliability Mono-trait / Hetero-method Hetero-trait / Mono-method Hetero-trait / Hetero-method

Incomplete Sentences

A. Sex-Guilt

.73 .10 .43 .64 .17 .37 .48

B. Hostility-Guilt

.10 .63 .17 .22 .67 .19 .15 .41


.35 .16 .52 .31 .17 .56 .41 .30 .58

From Mosher, 1968


Slide 29


Argument-Based Validity 





Criterion, content, and construct validity are crucial aspects of assessment result validity, but how do we demonstrate the link to the inferences and uses of the assessment results? Argument-based validity (e.g., Kane, 1992) provides logic using Toulmin’s structure of an argument to support claims about inferences. Bachman (2005) expands this to include validity arguments for use cases.


Slide 30


Example Toulmin Structure for Validity Inference Claim: Mike cannot make a sandwich

Unless

Rebuttal: Too many questions were about the bread, and Mike did not have sufficient opportunity to demonstrate knowledge of ingredients and layering standards

Since Warrant: Poor performance on the sandwich exam correlates with low performance of making sandwiches

Rejects

So Supports

Backing Evidence: Criterion validity study of sandwich exam scores and sandwich assembly performance at the sandwich shop

Rebuttal Evidence: Content validity study confirms that items are categorized correctly for blueprint. Blueprint is based on results of JTA. There were not too many questions about bread.

Data: Mike got a failing score on his exam about making a sandwich


Slide 31


Argument-Based Validity for Use Cases 

Bachman (2005) defines four decision (use case) warrants that should be addressed with a validity argument for each use case associated with the assessment results: Is the interpretation of the score relevant to the decision being made?  Is the interpretation of the score useful for the decision being made?  Are the intended consequences of the assessment beneficial for the stakeholders?  Does the assessment provide sufficient information for making the decision? 


Slide 32


Argument-Based Validation Study 

Argument-Based Validation Study (Chapelle, Enright, & Jamieson, 2010) Identify inferences, the warrants leading to these inferences, and the assumptions underlying the warrants. Document these inferences. 2. Identify or collect evidence backing the assumptions for the warrants. Document this evidence. 3. Identify or collect rebuttals, and document evidence supporting or refuting the rebuttal. Document this evidence with the evidence backing the assumptions. 1.


Slide 33


Utility of Argument-Based Validity 



“Validation can be viewed as developing a scientifically sound validity argument to support the intended interpretation of test scores and their relevance to the proposed use” (APA, NCME, & AERA, 1999). Argument-based validity forces us to look at and document the logical connections between classic validity studies and the real world defensibility of our assessment (Sireci, 2013).


Slide 34


Utility of Argument-Based Validity 

By requiring test developers to research inferences and build the argument structure to support these inferences, we can avoid three common fallacies of validity studies:   

Taking inferences and their assumptions as “givens” Making overly-ambitious, unrealistic inferences Making a claim of validity by selectively choosing evidence while glossing over evidence of weaknesses in the inferences

From Kane, 2006 2014 Users Conference  San Antonio

Slide 35


Evidence-Centered Design (ECD) A Principled Test Development Framework



Principled Test Development Frameworks 

Frameworks for how to connect assessment tools and practices to reach desired goals for assessment quality. Practical methods for implementing assessment design and development  Guides test developers to make thoughtful, explicit decision  Improve the efficiency and effectiveness of item/task development  Typically supports the documentation of evidence needed to support argument-based validity 



Helps with increased design decision and granular data needs while minimizing construct-irrelevant variance. From Ferrara, Nichols, & Lai, 2013 2014 Users Conference  San Antonio


Examples of Principled Test Development Frameworks  

 



Diagnostic Assessment Framework Construct-Centered Framework Evidence-Centered Design Assessment Engineering Principled Design for Efficacy

From Ferrara, Nichols, & Lai, 2013 2014 Users Conference  San Antonio

Slide 38


Principled Test Development Survey  

Do you use a principled test development framework? Have you wanted to try to implement a principled test development framework, but been deterred because it seems like too much work?



Evidence-Centered Design 





ECD is a framework for assessment development that is designed to create the evidence needed to support assessment inferences as the assessment is being built (e.g., Mislevy, 2011; Mislevy et al., 2012,). Applies a broad range of assessment design resources (e.g., subject matter knowledge, software design, psychometrics, pedagogical knowledge) to the inferences. Avoids awkward situation of finding validity problems after the assessment has already been built. 2014 Users Conference  San Antonio

Slide 40


ECD Process Domain Analysis Domain Modeling

• What is important about this domain (construct)? • What work and situations are central to this domain? • How do we represent the aspects from the domain analysis as assessment arguments?

Conceptual Assessment Framework

• Design structures: student model, evidence model, and task model

Assessment Implementation

• Building the assessment: item writing, scoring engines, statistical models

Assessment Delivery

• Participants interact with items/tasks. Performance is evaluated, and results and feedback are reported.

From Mislevy et al., 2012 2014 Users Conference  San Antonio

Slide 41


ECD Flexibility 

ECD is designed to be flexible enough to accommodate any assessment design.    



Different construct modeling approaches New item types and assessment formats with new technology Different scoring models, or combinations of scoring models Growing use of assessment scores and inferences

ECD vocabulary and process aligns test development work across disciplines  

Documents how test development outcomes connect Common vocabulary helps people understand what they are doing and why 2014 Users Conference  San Antonio

Slide 42


Example of ECD: IMMEX True Roots 

  

Educational game to measure cognitive behavior based on sequence of participants’ actions Captures sequence of responses in the game Classifies sequence with artificial neural network Designed and reported with ECD (Stevens & Casillas, 2006)

True Roots problem space (Cox Jr., Jordan, Cooper, & Stevens, 2004)


Slide 43


Conceptual Assessment Framework (CAF) 





ECD may be a lot to implement for every assessment, but the principles can still help guide our test development work. The CAF represents the keystone of ECD. This is how we begin to explain the intellectual leap from scores to inferences. Three parts of an assessment’s CAF:   

Task Model Student Model Evidence Model 2014 Users Conference  San Antonio

Slide 44


Conceptual Assessment Framework (CAF)

Student Model

Evidence Model

Task Model

• Evaluation Component • Measurement Model Component

From Mislevy et al., 2012 2014 Users Conference  San Antonio

Slide 45


CAF: Task Model 



Defines the assumptions and specifications for what the participant can do in the assessment and the features of the environment in which the task takes place (Mislevy et al., 2012). Examples of task model decisions:     

Item format and content Delivery format (random delivery, time limits) Resources Translations or accommodations Response format 2014 Users Conference  San Antonio

Slide 46


CAF: Student Model 



Defines the construct and construct relationships that are being measured from which we will make an inference (Mislevy et al., 2012). Examples of student model decisions:    

Total score rules and interpretations Topic score rules and interpretations Rubric structure Conditional delivery (e.g., jump blocks, CAT)


Slide 47


CAF: Evidence Model 

Connects the task model and the student model (Mislevy et al., 2012). 

Evaluation Component – Defines how evidence is identified from the responses generated within the task model.  Rules for identifying correct responses  Rules for what aspects of a response to observe in performance task or human-scored item  Sequence and tagging



Measurement Model – Aggregates response data to yield inferences about the student.  Item scoring and outcomes  Weighting and scaling  Aggregation models (e.g., CTT, IRT, Bayes nets, regression, network analysis) 2014 Users Conference  San Antonio

Slide 48


Reporting the CAF 

 



There will be blurred lines between the three elements of CAF, but this is because the three models are interdependent. Documenting the CAF is becoming more common in the literature. Provides defensibility by being able to demonstrate how your instrument is collecting and scoring evidence about the construct to support specific inferences. Naturally lends itself to argument-based validity. This is the evidence needed to support many of your warrants. 2014 Users Conference  San Antonio

Slide 49


Thank you! Austin Fossey Reporting and Analytics Manager, Questionmark austin.fossey@questionmark.com



References 



 



American Psychological Association, National Council of Measurement in Education, American Educational Research Association. (1999). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association. Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 2(1), 1-34. Chapelle, C. A., Enright, M. K., & Jamieson, J. Cox Jr., C. T., Jordan, J., Cooper, M. M., Stevens, R. (2004). Assessing student understanding with technology: the use of IMMEX problems in the science classroom. Retrieved from http://www.ces.clemson.edu/IMMEX/Charlie/ on February 21, 2014. Crocker, L, & Algina, J. (2008). Introduction to Classical and Modern Test Theory. Mason, OH: Cengage Learning 2014 Users Conference  San Antonio

Slide 51


References 







Ferrara, S., Nichols, P., & Lai, E. (2013). Design and development for next generation tests: Principled design for efficacy (PDE). Proceedings from the Maryland Assessment Research Center Conference. Retrieved from http://marces.org/conference/commoncore/MARCES_SteveFerrara.pdf on February 20, 2014. Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527-535. Kane, M. (1996). Validation. In R. L. Brennan (Ed.), Educational Measurement: Fourth Edition (17-64). Westport, CT: Praeger Publishers. Klein, S. P., & Kosekoff, J. P. (1975). Determing how well a test measures your objectives. (CSE Report No. 94). Los Angeles, CA: Center for the Study of Evaluation, University of California.


Slide 52


References 







Mislevy, R.J. (2011). Evidence-centered design for simulation-based assessment. (CRESST Report 800). Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Mislevy, R. J., Behrens, J. T., Dicerbo, K. E., & Levy, R. (2012). Design and discovery in educational assessment: Evidence-Centered Design, psychometrics, and Educational Data Mining. Journal of Educational Data Mining, 4(1), 11-48. Mosher, D. L. (1968). Measurement of guilt by self-report inventories. Journal of Consulting and Clinical Psychology, 32, 690-695. Rovinelli, R. J., & Hambleton, R. K. (1977). On the use of content specialists in the assessment of criterion-referenced test item validity. Dutch Journal of Educational Research, 2, 49-60. 2014 Users Conference  San Antonio

Slide 53


References 



Sireci, S. G. (2013). A theory of action of test validation. Proceedings from the Maryland Assessment Research Center Conference. Retrieved from http://marces.org/conference/commoncore/MARCES_SteveSireci.pdf on February 20, 2014. Stevens, R. H., & Casillas, A. (2006). Artificial neural networks. In D. M. Williamson, R. J. Mislevy, & I. I. Bejar (Eds.), Automated scoring of complex tasks in computer-based testing (pp. 259-312). Mahwah, NJ: Erlbaum.


Slide 54