RUNNING HEAD: Using Item Response Time Data
Using Item Response Time Data in Test Development and Validation: Research with Beginning Computer Users1,2
April L. Zenisky and Peter Baldwin
Center for Educational Assessment University of Massachusetts Amherst
Correspondence concerning this research should be addressed to: April L. Zenisky:
[email protected]
1
Center for Educational Assessment Report No. 593. Amherst, MA: University of Massachusetts, School of Education. 2 Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA: April 8-10, 2006.
Abstract One added benefit of the choice many testing programs have made to administer assessments via computer is the ability to easily gather item response time data for items. Research has provided a number of important ways for this information to be incorporated back into test development, administration, and validation. Indeed, with respect to validity, the growing body of literature on the relationship between response time and item- and person-level factors that affect response time has helped test users understand examinee behavior from databased perspectives not previously feasible, and illustrates the important role these investigations can play in test development. The present research further explores the use of response time data, particularly with respect to construct-related validity: here, we investigate the relationship between median response time and item difficulty, item complexity, and cognitive area assessed in the context of math and reading assessments given to adult basic education students, and also consider differences in response time as they relate to group membership (English as a first or second language). While difficulty, complexity, and cognitive area assessed were shown to impact response time generally, ability differences among examinees also seem to have influenced the magnitude of the differences observed both overall with respect to group membership.
Using Item Response Time Data 3 Introduction Background The increasing use of computers as the mode of test administration has given more test developers access to item response times than was possible in paper testing, and recent research has provided a number of avenues for using that information in the test development and validation process. These applications include using response time to: inform item selection in the context of computer adaptive testing (e.g., Wang & Hanson, 2005; van der Linden, 2005; van der Linden, in press), identify response aberrancy (van der Linden and van Krimpen-Stoop, 2003, Fitzgerald, Fremer, & Maynes, 2005), model examinee motivation (Wise & Kong, 2005; Wise & DeMars, 2006), manage test-taking strategies and speededness (van der Linden, Scrams, and Schnipke, 1999), and predict testing time/set time limits on tests (Bergstrom, Gershon, & Lunz, 1994; Halkitis, Jones, & Pradhan, 1996). Much of the research into response time has endeavored to identify item and person factors that affect how long it takes for examinees to answer test questions. A multiple regression approach used by Halkitis, Jones, and Pradhan (1996) examined the relationship between item response times and p-values, point-biserial correlations, and item word counts and found that as difficulty, discrimination, and word count increased, response time increased. Smith (2000) also used multiple regression to determine the relationship between median response time and five item-level factors (item discrimination, item difficulty, the presence of a figure in an item, word count, and item type). Not unexpectedly, item difficulty exhibited a consistently strong positive relationship across item types with response time, while word count and discrimination varied in significance across item types. No significant relationship between response time and the presence of a figure was found. However, using data from a certification
Using Item Response Time Data 4 examination, Bergstrom, Gershon, and Lunz (1994), found that in general the presence of a figure did increase response time, as did higher item word counts and item difficulty values. Bridgeman and Cline (2000) examined the relationship between response time and a) item difficulty (items were categorized into five difficulty categories based on their Item Response Theory (IRT) difficulty estimate), b) item type (problem solving or quantitative comparison), c) content area (arithmetic, algebra, geometry, and data interpretation), and d) degree of abstraction (items using only numbers or symbols versus items presented in real-world contexts, which were frequently word problems). Though the study found substantial variation in expected response time for different items, items with long expected response times did not appear to disadvantage examinees receiving those items. Research by Parshall, Mittelholtz, and Miller (1994) considered presentation order, content classification, and cognitive classification: though cognitive classification was found to have no relationship to latency, presentation order and content classification did. Yang, O’Neill, and Kramer (2002) and Masters, Schnipke, and Connor (2005) both examined the relationship between item difficulty and response time, but the latter researchers also investigated the impact of items requiring calculations and items requiring the use of external supplemental information on response times. Their results indicated that controlling for difficulty alone in item selection would not ensure a consistency in total testing time across forms, as variations within and across item types relate significantly to item response times. An interesting extension of this line of response time analysis was carried out by Schipke and Pashley (1997), who investigated response time as it related to differences in test performance for different subgroups. They found that significant differences in response times existed
Using Item Response Time Data 5 between examinees for whom English was and was not a first language, and that a speededness effect might be particularly pronounced for non-native English speakers.
Rationale and Purpose Across different testing contexts, response time data may provide construct validity evidence or illuminate possible sources of construct-irrelevant variance. A considerable effort is made to standardize testing conditions with the goal of minimizing construct-irrelevant variance or—at the very least—equalizing it for all examinees. However, when speededness is not intentionally part of the construct being measured, differential speededness due to subgroup membership or test composition (in the case of adaptive testing) has the potential to undermine the goals of standardization. Consideration of differences in response latency has the potential to inform test developers and test users about the test-taking experience more broadly and supply evidence either supporting or refuting some of the assumptions about standardization. For example, by examining response time on a computer-based test, Huff and Sireci (2001) were able to identify construct-irrelevant variance due to computer familiarity and computer proficiency. In this case, the test delivery mode, while standardized, had a differential effect on examinee performance (clearly an area of concern for test developers). Where significant differences among subgroup performance exist, the relationship between item factors, response time, and subgroup performance can help explain these differences. Even when performance differences are minimal, differences in response time for different items may be diagnostically useful. The current study was prompted by the presence of differences in overall response latency for items written to different cognitive areas. These discrepancies were detected in the course of routine analyses of response time data for a computer-administered field test. An
Using Item Response Time Data 6 investigation of these differences is Study 1. Response time differences for examinee subgroups (based on self-reports of language status: English as first language or not) were also discovered, and these findings were consistent with research by Abedi, Lord, and Plummer (1997) where NAEP items were grouped into long and short items, and students who were English-language learners were found to perform significantly lower on longer test items as well as items judged to be more linguistically complex. The appearance of these differences prompted Study 2 as an inquiry into item-level sources of such differences. Ultimately, the research presented here should be viewed as part of the growing body of research on factors affecting response time (including Smith, 2000; Halkitis, Jones, and Pradhan, 1996; Bergstrom, Gershon, & Lunz, 1994; Yang, O’Neill, & Kramer, 2002; and Masters, Schnipke, & Connor , 2005), with the goal of providing further insight into the relationship between test items, examinee response time, and examinee performance, particularly the performance of subgroups. In addition, it is hoped that this research will offer strategies for the kinds of analyses testing programs administering assessments via computer might carry out using response time data. Data source Data for both studies come from pilot administrations of computerized multiple-choice tests in Math and Reading for adult basic education (ABE) students in one Northeastern state carried out between October 2005 and January 2006. Students in this population span a wide range of academic proficiency, from pre-literacy/pre-numeracy to GED level, and they span an equally wide range of computer literacy and familiarity to the task of taking a computerized test. For Math, 3,284 students in total completed a 40-item Mathematics pilot test, and 3,254 students in total completed a 40-item pilot test for Reading. In Mathematics, there were four
Using Item Response Time Data 7 broad levels of difficulty (Low, Medium, High, and Advanced), while in Reading there were three levels of difficulty (Low, Medium, and High). Students’ teachers assigned them to levels for this assessment. The count of examinees per level in each content area is shown in Figure 1. Figure 1. Count of examinees per content area/level
The use of multiple forms within each level allowed for data to be collected on a total of 349 items in Mathematics and a total of 306 Reading items. All test questions in this pilot were multiple-choice. The computer test was delivered via the Internet at adult learning centers in this Northeastern state. Each student was provided with a username and password to access the test, and was given two hours to complete the test in one content area. This time limit was purely for administrative convenience, as all examinees finished the Reading test within about one hour, and the longest any examinee spent on Math was one hour, ten minutes. A screen capture of the test window is provided in Figure 2.
Using Item Response Time Data 8 Figure 2. Screen Capture of Test Window.
Item Factors The literature connecting item response times and item-level factors has generally focused on item difficulty, discrimination, word counts, item type, and presence of a figure (e.g., Smith, 2000; Halkitis, Jones, and Pradhan, 1996; Bergstrom, Gershon, & Lunz, 1994; Yang, O’Neill, & Kramer, 2002; and Masters, Schnipke, & Connor, 2005). In the present study the focus is on item difficulty, the cognitive area assessed, and item complexity level. Defined below is how each of these factors is operationalized for the purpose of these studies. •
Item difficulty: Each level of the test within content areas was calibrated using the one-parameter logistic IRT model, and the obtained b-values were then linked to be
Using Item Response Time Data 9 on the same scale within content areas. As was done in the Bridgeman & Cline (2000) study, the item difficulty values were then recoded into difficulty categories to establish the broad nature of the relationship between response time and item difficulty in relation to the other independent variables. The counts of items in the five difficulty categories are given in Table 1 below. Table 1. Counts of items in difficulty categories by content area/test level Very Easy Easy Medium Difficult Content Area/ b ≤ -1.5 -1.5 < b < -.5 -.5 < b < .5 .5 < b < 1.5 Test Level Math, Low 19 49 54 7 Math, Medium 3 22 64 22 Math, High 1 11 47 57 Math, Advanced 2 19 38 Reading, Low 2 35 73 17 Reading, Medium 20 71 22 Reading, High 15 75 32 •
Very Difficult b ≥ 1.5
2 4 18
4
Cognitive dimension assessed: All Math items were written to correspond to one of three cognitive dimensions as specified in the test specifications. These dimensions, which are in essence groupings of the cognitive domains in Bloom’s taxonomy, are 1) Knowledge and Comprehension, 2) Application, and 3) Analysis, Synthesis, and Evaluation. For Reading, each item was associated with one of four reading purposes: 1) for Word Skills, 2) to Gain Information, 3) for Literature, or 4) to Perform a Task.
•
Item complexity level: The development of this dimension was informed by the research of Bridgeman and Cline (2000) on “real” and “pure” items, and certainly other research cited above has found the presence (or absence) of one item component or another (e.g., passages, figures, tables) to affect response time. For the set of 349 Math items examined here, some items had figures or tables in the stems, some had figures or tables in the answer choices, some required the use of a
Using Item Response Time Data 10 calculator, and some had none of these elements. For the present research, if a math item had one or more of these elements, it was considered to be a ‘complex’ item, while an item without any such elements was comparatively ‘simple’ in that it was phrased in the form of a question or incomplete stem without reference to any external element. Among the Reading items, the inclusion of a passage or graphic in the stem likewise characterized complex items, while simple items were again those questions or incomplete stems without reference to external elements. Table 2 summarizes the number of items by test level, cognitive area/reading purpose assessed, and complexity level for Math, and Table 3 does the same for Reading.
Using Item Response Time Data 11 Table 2. Counts of Math items by test level, cognitive dimension, and item complexity Number of Number of Cognitive Number of Complexity Level Items Items Dimension Items Simple 29 K&C 50 Complex 21 Simple 33 Low 129 Application 52 Complex 19 Simple 13 ASE 27 Complex 14 Simple 25 K&C 38 Complex 13 Simple 34 Medium 113 Application 52 Complex 18 Simple 15 ASE 23 Complex 8 Simple 18 K&C 28 Complex 10 Simple 32 High 120 Application 57 Complex 25 Simple 17 ASE 35 Complex 18 Simple 7 K&C 15 Complex 8 Simple 20 Advanced 77 Application 33 Complex 13 Simple 12 ASE 29 Complex 17
Using Item Response Time Data 12 Table 3. Counts of Reading items by test level, cognitive dimension, and item complexity Reading N Complexity N Level N Purpose Simple 46 Word Skills 53 Complex 7 Simple 6 Information 29 Complex 23 Low 127 Simple 2 Literature 21 Complex 19 Simple 2 Tasks 24 Complex 22 Simple 21 Word Skills 23 Complex 2 Simple 4 Information 38 Complex 34 Medium 113 Simple -Literature 35 Complex 35 Simple 3 Tasks 17 Complex 14 Simple -Word Skills -Complex -Simple 1 Information 59 Complex 58 High 126 Simple 5 Literature 47 Complex 42 Simple 2 Tasks 20 Complex 18
Study 1 Method The purpose of Study 1 is to evaluate the overall relationship between selected item-level factors (item difficulty, math cognitive area/reading purpose classification, and item complexity) and median response time in seconds. First, median response times in seconds for each item were computed. The median is preferred in this research to the mean because some of the cells in the analysis of items consist of small numbers of items, and medians are less susceptible to outliers. Next, for each content/area and test level combination, a three-factor ANOVA was carried out to evaluate the relationship between median response time and: item difficulty, math
Using Item Response Time Data 13 cognitive dimensions/reading purpose, and item complexity, as well as interaction effects. In Math, there were (up to) 5 difficulty categories by 3 cognitive dimensions by 2 levels of item complexity; in Reading, there were (up to) 5 difficulty categories by 4 reading purposes by 2 levels of item complexity. When a factor or interaction was significant at the 0.05 level, Tukey’s HSD test was used to evaluate post hoc comparisons.
Results A full set of the descriptive statistics for median response times, item difficulty, cognitive areas/reading purposes, and item complexity (as well as all combinations thereof) for Math and Reading are located in Appendix A. In Figure 3 below is a line graph of the means of the median response times for item difficulty for both content areas and all test levels. Figure 3. Item difficulty by mean of median response time, Math and Reading, all test levels
Using Item Response Time Data 14 As seen above in Figure 3, the broad pattern observed in both Math and Reading indicates that as item difficulty increases, median response time does likewise. Some exceptions are present, such as the Medium test level in Math where Medium difficulty items were associated with the highest median response times for this group, as well as the High test level of Math, where the most difficult items were not the most time consuming on average. Figure 4 is the line graph of the mean of median response time for the cognitive areas for Math and reading purpose for Reading. Figure 4. Cognitive area/reading purpose by mean of median response time, Math and Reading, all test levels 80
Math
Reading
70
60
50
Math, Low Math, Medium Math, High Math, Advanced Reading, Low Reading, Medium Reading, High
40
30
20
10
0 K&C
App
ASE
Word Skills
Information
Literature
Tasks
Cognitive Area/Reading Purpose
The patterns for Math reflect the relatively hierarchical nature of the cognitive areas as defined in the test specifications for this assessment (as well as Bloom’s taxonomy on which those cognitive groupings were broadly based), as for each test level, median response time clearly increases given the increasing cognitive complexity of the items. The pattern is quite
Using Item Response Time Data 15 different in Reading, and this is to be expected as well because the Reading purposes are not similarly ordered. While the Word Skill items are generally more basic in nature and require the least time, the other three reading purposes were not intended to reflect a hierarchy of cognitive complexity and not surprisingly were similar with respect to median response time. The means of median response time for item complexity are given in Figure 5. For both Math and Reading, at all test levels complex items required more time than simple items. Figure 5. Item complexity by mean of median response time, Math and Reading, all test levels
Table 4 provides the ANOVA results for Math (statistically significant results are shaded). The results in Table 4 indicate that there are several significant main effects on median response time, and these main effects are different at different test levels. Cognitive dimension was significant for all Math test levels (Low, F(2,107)=5.659, p=.005; Medium, F(2,94)=9.575, p=.000; High, F(2,100)=3.126, p=.048; Advanced, F(2,57)=3.449, p=.039). Based on Cohen’s (1998) guidelines for reporting effect sizes using partial eta squared (ηp2), the effect sizes for
Using Item Response Time Data 16 Low, Medium, and Advanced were large, while in the High test level, cognitive level had only a small effect. Next, for the Low and Medium test levels, the ANOVA indicates that the groups of items composed on the basis of item difficulty were significant (F(3,107)=4.961, p=.003, and F(4,94)=3.556, p=.010, respectively). In both cases those are medium effects with respect to partial eta squared (ηp2) results. However, difficulty was not significant for the High and Advanced test levels. While item complexity was not significant for the High or Advanced test levels, it was significant for Low and Medium test levels (F(1,107)=6.539, p=.012, and F(1,94)=5.336, p=.023, respectively), although these were small effects according to Cohen’s guidelines.
Using Item Response Time Data 17 Table 4. Results of three-factor ANOVAs for median response time by test level in Math Signif. Content ANOVA Between Sum of df F η p2 (P