Limiting the Number of Revisions While Providing Error-Flagging Support During Tests Amruth N. Kumar Ramapo College of New Jersey, Mahwah, NJ 07430, USA
[email protected]
Abstract. Error-flagging support provided during tests leads to higher scores, as reported in literature. Although many beneficial factors contribute to higher scores, one undesirable contributing factor is that students abuse error-flagging feedback to find the correct answer through trial and error even when the test is not multiple-choice in nature. A limit can be placed on the number of revisions allowed per problem to foil the trial and error approach. A follow-up study was conducted to examine whether limiting the number of revisions allowed per problem yielded the benefits of error-flagging feedback while alleviating its shortcomings. The study also considered the effects of error-flagging feedback on partial scores. The findings are: even with a limit placed on the number of revisions per problem, students revised more often and scored higher with rather than without error-flagging. When students solved problems incorrectly without revisions, their solution qualified for more partial credit when errorflagging support was provided. When a limit was placed on the number of revisions and students solved problems correctly with revisions, they did so with fewer revisions when error-flagging feedback was provided than when it was not. When students solved problems incorrectly with revisions, even with a limit placed on the number of revisions, they revised more often with errorflagging than without, scored more partial credit, but did not take more time than when error-flagging was not provided. A limit on the number of revisions may discourage students from using error-flagging feedback as a substitute for their own judgment. Overall, students solved problems faster with errorflagging feedback, even though revisions prompted by such feedback can cost time. Keywords: Error-flagging, Testing, Adaptation, Evaluation.
1 Introduction and Experiment In a recent study of online tests that do not involve multiple-choice questions [1], students scored better on tests with rather than without error-flagging support. A follow-up study [2] found that when error-flag feedback is provided, students save time on the problems that they already know how to solve, and spend additional time on the problems for which they do not readily know the correct solution. It also found that students may abuse error-flagging support to find the correct solution by trial and
error. The work reported herein was conducted as a follow-up to study 1) whether limiting the number of revisions allowed per problem would yield the benefits of error-flagging feedback while foiling abuse; 2) the effect of error-flagging feedback on partially correct solutions. This work is of relevance to the tutoring systems community in that adaptive tutors often use an online pretest to prime the student model. Since error-flagging feedback helps students avoid inadvertent mistakes, tutors that provide error-flagging feedback during their pretest can build a more accurate student model that facilitates better adaptation of tutoring content. For the current study, two problem-solving software tutors were used in fall 2011. The tutors were on predicting the behavior of while and for loops in introductory computer programming. The while loop tutor targeted 9 concepts; for loop tutor targeted 10 concepts. The tutors presented problems on these concepts, each problem containing a program whose output had to be identified by the student. Each software tutor went through pretest-practice-post-test protocol in 30 minutes. Since this is a study of the effect of error-flagging feedback during testing, data from only the pretest portion of the tutor was considered for analysis. The evaluations were conducted online and in-vivo. The tutors were used in introductory programming courses at 11 institutions which were randomly assigned to one of two groups: A or B. A partial cross-over design was used: students in group A served as test subjects on while loop tutor and control subjects on for loop tutor, while students in group B served as control subjects on while loop tutor and test subjects on for loop tutor. All else being equal, error-flagging feedback was provided during pretest to students in the test group, but not the control group. Errorflagging feedback was provided before the student submitted the answer. When solving a problem, students identified the output of a program, one at a time. Identifying each output consisted of entering the output string free-hand, and selecting from a drop-down menu, line number of the code that generated the output. Students could go back and delete a previously entered output by clicking on the “Delete” button paired with it. When error-flagging feedback was provided, if an answer was incorrect, it was displayed on red background if incorrect, and green background if correct. When error-flagging support was not provided, the answer was always displayed on white background. When error-flagging support was provided, no facility was provided for the student to find out why the output was incorrect, or how it could be corrected. The online instructions presented to the students before using each tutor explained the significance of the background colors. Whether or not the tutor provided error-flagging feedback, students had the option to revise their answer (e.g., “Delete” button described earlier) before submitting it. The interface always displayed the number of available revisions (maximum 3). If the student used up all available revisions, thereafter, the student could add additional outputs, but could no longer delete any previously identified outputs. These features were described in the instructions presented to the students at the beginning of each tutor.
2 Results For analysis, only those students were considered who had used both while and for loop tutors. Only those students were considered who attempted most of the pretest problems: at least 6 of the 9 problems on while loop tutor and 6 of the 10 problems on for loop tutor. Students who scored 0 or 100% on either pretest were excluded. This left a total of 155 students - 126 students in group A and 29 students in group B. In order to factor out the effect of the difference in the number of problems solved by the students, the average score per pretest problem (range 0 1.0) was considered for analysis rather than the total score. Score Per Problem: A 2 X 2 mixed-factor ANOVA analysis of the score per pretest problem was conducted with the treatment (without versus with error-flagging support) as the repeated measure and the group (group A with error-flagging on while loop versus group B with error-flagging on for loop pretest) as the between subjects factor. A significant main effect was found for error-flagging [F(1,153) = 77.662, p < 0.001]: students scored 0.541 ± 0.040 without error-flagging and 0.820 ± 0.024 with error-flagging (at 95% confidence level). The difference was statistically significant [t(154) = -14.289, p < 0.001]. The effect size (Cohen’s d) is 1.323, indicating a large effect – test group mean is at 90th percentile of the control group. So, even with a limit placed on the number of revisions per problem, students scored more with errorflagging support during tests than without. A large significant interaction was found between treatment and group [F(1,153) = 26.441, p < 0.001]. As shown in Table 1, the group with error-flagging scored statistically significantly more than the group without error-flagging on both while loop pretest [t(153) = 3.414, p = 0.001] and for loop pretest [t(153) = -6.050, p < 0.001]. Similarly, each group scored more with error-flagging than without [t(125) = 16.378, p < .001] for group A and [t(28) = -1.912, p = .066] for group B. Table 1. Average pretest score with and without error-flagging
Without error-flagging With error-flagging
while loop pretest 0.704 ± 0.087 0.827 ± 0.027
for loop pretest 0.503 ± 0.043 0.789 ± 0.051
Time Per Problem: A 2 X 2 mixed-factor ANOVA analysis of the time per pretest problem was conducted with the treatment as the repeated measure and the group as the between subjects factor. A significant main effect was found for errorflagging [F(1,153) = 6.581, p = 0.011]: students spent 122.412 ± 7.455 seconds without error-flagging and 95.609 ± 6.150 seconds with error-flagging support. The difference was statistically significant [t(154) = 6.582, p < 0.001]. The effect size (Cohen’s d) is 0.617, indicating a large effect – test group mean is at 73rd percentile of the control group. So, overall, students solved problems faster with error-flagging feedback, even though revisions prompted by such feedback can cost time. A large significant interaction was observed between treatment and group [F(1,153) = 21.456, p < 0.001]. As shown in Table 2, the group with error-flagging
solved problems faster than the group without error-flagging, but the difference was not statistically significant on either pretest. The difference with versus without errorflagging was significant for group A [t(125) = 8.826, p < .001], but not for group B. Table 2. Average pretest time per problem with and without error-flagging
Without error-flagging With error-flagging
while loop pretest 102.913 ± 13.525 91.594 ± 5.960
for loop pretest 126.900 ± 8.455 113.051 ± 19.269
Number of Revisions: A 2 X 2 mixed-factor ANOVA analysis of the number of revisions was conducted with the treatment as the repeated measure and the group as the between subjects factor. A significant main effect was found for error-flagging [F(1,153) = 50.711, p < 0.001]: students revised an average of 1.26 ± 0.232 times without error-flagging and 3.90 ± 0.623 times with error-flagging support. The difference was statistically significant [t(154) = -7.988, p < 0.001]. The effect size (Cohen’s d) is -0.885, indicating a large effect – test group mean is at 82nd percentile of the control group. So, even with a limit placed on the number of revisions per problem, students revised their answers more often with error-flagging support than without. Both the groups revised more often with error-flagging than without, as shown in Table 3. The difference with versus without error-flagging was significant for group A [t(125) = -6.354, p < .001] as well as group B [t(28) = -6.011, p < .001]. Table 3. Number of revisions with and without error-flagging
Without error-flagging With error-flagging
while loop pretest 1.17 ± .584 3.70 ± .721
for loop pretest 1.29 ± .253 4.76 ± 1.086
As in the previous study [2], we considered four cases for comparing students with and without error-flagging support: 1. Students solved a problem correctly without any revisions – we compared the time students took to solve each problem. 2. Students solved a problem incorrectly without any revisions – we compared the partial score and time spent per problem. 3. Students solved a problem correctly with revisions – we compared the number of revisions and time spent per problem. 4. Students solved a problem incorrectly with revisions – we compared the partial score, time spent per problem and number of revisions. The limit placed on the number of revisions per problem is expected to affect cases 3 and 4 only. Case 1 – Problem solved correctly without any revisions: Univariate analysis of variance of the time spent per problem yielded a significant main effect for treatment [F(1,1135) = 33.462, p < .001]: students spent 91.56 ± 6.99 seconds per problem without and 67.9 ± 4.57 seconds with error-flagging support. This confirms the earlier result - when error-flagging support is provided, students save the time they would have spent re-checking their solution.
Case 2 – Problem solved partially or incorrectly without any revisions: ANOVA analysis of the time spent per problem yielded significant main effect for treatment [F(1,1146) = 7.178, p = .007]: students solved the problems in 136.48 ± 7.108 seconds per problem without and 117.42 ± 12.726 seconds with error-flagging support. ANOVA analysis of the partial score yielded a significant main effect for treatment [F(1,1146) = 183.288, p < .001]: students scored 0.209 ± .021 points per problem without error-flagging, and 0.495 ± .037 points per problem with errorflagging support. So, even when students solved problems incorrectly without revisions, their solution qualified for more partial credit when error-flagging support was provided. In this study, they also solved the problems faster than when errorflagging was not provided. Case 3 – Problem solved correctly, with revisions: ANOVA analysis of the time spent per problem yielded no significant main effect for treatment: [F(1,290) = 0.166, p = 0.684]: whereas students solved problems correctly in an average of 92.91 ± 13.01 seconds without error-flagging and 97.74 ± 11.00 seconds with error-flagging, the difference was not statistically significant. Analysis of the number of revisions yielded a significant main effect for treatment: [F(1,290) = 20.44, p < .001]: students revised their answers 1.49 ± .178 times without error-flagging, and 1.16 ± .056 times with error-flagging. So, when a limit was placed on the number of revisions, students solved problems correctly with fewer revisions when error-flagging support was provided than when it was not. We speculate that when students are made aware of the limit placed on the number of revisions allowed, they deliberate more before revising and therefore, need fewer revisions. Fewer revisions may also explain why students spent less time with rather than without error-flagging feedback. Revisions still carry a time penalty – among the problems students with errorflagging support solve correctly, the problems solved without revisions take significantly less time (67.9 ± 4.575 seconds) than the problems solved with revisions (97.74 ± 11.0 seconds) [t(891) = 5.794, p < .001]. Case 4 – Problem solved partially or incorrectly with revisions: ANOVA analysis of the time spent per problem yielded no significant main effect for treatment [F(1,265) = .024, p = 0.876]: students spent about the same amount of time without (145.79 ± 19.62 seconds) as with error-flagging support (147.7 ± 13.07 seconds). ANOVA analysis of the number of revisions yielded significant main effect for treatment [F(1,265) = 8.411, p = 0.004]: students revised 1.42 ± .155 times without error-flagging and 1.73 ± .12 times with error-flagging. ANOVA analysis of the partial credit earned by students yielded a significant main effect for treatment [F(1,265) = 27.82, p < .001]: students scored .221 ± .067 points without errorflagging and .435 ± .043 with error-flagging. So, even when a limit is placed on the number of revisions, students revise more often with error-flagging than without, score more partial credit, but do not take more time than when error-flagging is not provided. Table 4 lists the percentage of problems that were solved correctly/incorrectly, with/without revisions in the two treatments. Prior study had reported that students with error-flagging feedback solved a third fewer problems correctly without revisions than with revisions, presumably because students were using error-flagging feedback as a substitute for their own judgment. With the introduction of a limit on
the number of allowed revisions, students with error-flagging feedback solved nearly three times as many problems correctly without revisions than with revisions. This reversal suggests that a limit on the number of revisions may discourage students from using error-flagging feedback as a substitute for their own judgment. As in the prior study, we note that the percentage of students who solved problems incorrectly without any revisions is far smaller with than without error-flagging. In other words, students take advantage of error-flagging feedback to fix an incorrect answer. It is clear that students with error-flagging support revise their solution far more than those without error-flagging support, whether or not the solution eventually turns out to be correct. The objective of limiting the number of revisions allowed per problem is to minimize the amount of time students spend revising solutions that eventually turn out to be incorrect, and/or increase the partial credit students score in such cases. Case 4 above bears out that this objective was met. Table 4. Percentage of problems solved correctly/incorrectly, with and without revision
Without Error-Flagging With Error-Flagging
Solution never revised Correct Partial/Incorrect 33.08 57.63 47.14 22.60
Solution revised Correct Partial/Incorrect 3.95 5.34 16.74 13.52
In conclusion, placing a limit on the number of revisions per problem did yield the benefits of error-flagging feedback while foiling abuse. Even with the limit, students revised more often and scored higher with rather than without error-flagging. When students solved problems incorrectly without revisions, their solution qualified for more partial credit when error-flagging support was provided. With the limit in place, when students solved problems correctly with revisions, they did so with fewer revisions when error-flagging feedback was provided than when it was not. When students solved problems incorrectly with revisions, even with the limit in place, they revised more often with error-flagging than without, scored more partial credit, but did not take more time than when error-flagging was not provided. A limit on the number of revisions discourages students from relying on error-flagging uncritically. Overall, students solved problems faster with error-flagging feedback, even though revisions prompted by such feedback can cost time. This makes the process of using a pretest to prime the student model in an adaptive tutor more efficient. Acknowledgments. Partial support for this work was provided by the National Science Foundation under grant DUE-0817187.
References 1. Kumar, A.N.: Error-Flagging Support for Testing and its Effect on Adaptation. In: Proc. Intelligent Tutoring Systems (ITS 2010), LNCS 6094, pp 359-368. (2010) 2. Kumar, A.N. Error-Flagging Support and Higher Test Scores. In: Proc. Artificial Intelligence in Education (AI-ED 2011), LNAI 6738, pp 147-154. (2011)