Learning what works in ITS from non-traditional ... - CiteSeerX

Pardos, Z. A., Dailey, M. N., Heffernan, N. T. (submitted) Learning what works in ITS from nontraditional randomized controlled trial data. In Proceedings of the 10th International Conference on Intelligent Tutoring Systems. Pittsburg, PA. Springer-Verlag: Berlin

Learning what works in ITS from non-traditional randomized controlled trial data Zachary A. Pardos, Matthew N. Dailey, Neil T. Heffernan Worcester Polytechnic Institute {zpardos, mdailey, nth}@wpi.edu

Abstract. The traditional, well established approach to finding out what works in education research is to run a randomized controlled trial (RCT) using a standard pretest and posttest design. RCTs have been used in the intelligent tutoring community for decades to determine which questions and tutorial feedback work best. Practically speaking, however, ITS creators need to make decisions on what content to deploy without the benefit of having run an RCT in advance. Additionally, most log data produced by an ITS is not in a form that can easily be evaluated with traditional methods. As a result, there is much data produced by tutoring systems that we would like to learn from but are not. In prior work we introduced a potential solution to this problem: a Bayesian networks method that could analyze the log data of a tutoring system to determine which items were most effective for learning among a set of items of the same skill. The method was validated by way of simulations. In this work we further evaluate the method by applying it to real world data from 11 experiment datasets that investigate the effectiveness of various forms of tutorial help in a web based math tutoring system. The goal of the method was to determine which questions and tutorial strategies cause the most learning. We compared these results with a more traditional hypothesis testing analysis, adapted to our particular datasets. We analyzed experiments in mastery learning problem sets as well as experiments in problem sets that, even though they were not planned RCTs, took on the standard RCT form. We found that the tutorial help or item chosen by the Bayesian method as having the highest rate of learning agreed with the traditional analysis in 9 out of 11 of the experiments. The traditional analysis, however, agreed with the choices of a group of subject matter experts more often than did the Bayesian method. The practical impact of this work is an abundance of knowledge about what works that can now be learned from the thousands of experimental designs intrinsic in datasets of tutoring systems that assign items in a random order. Keywords: Knowledge Tracing, Item Effect Model, Bayesian Networks, Randomized Controlled Trials, Data Mining

1

Introduction

The traditional, well-established approach to finding out what works in an intelligent tutoring system is to run a randomized controlled trial (RCT) using a

2

Zachary A. Pardos, Matthew N. Dailey, Neil T. Heffernan

standard pretest and posttest design. RCTs have been used in the intelligent tutoring systems (ITS) community for decades to determine best practices for a particular context. Practically speaking, however, ITS creators need to make decisions on what content to deploy without the benefit of having run an RCT in advance. Additionally, most log data produced by an ITS is not in a form that can easily be evaluated with traditional hypothesis testing such as learning gain analysis with t-tests and ANOVAs. As a result, there is much data produced by tutoring systems that we would like to learn from but are not. In prior work [1] we introduced a potential solution to this problem: a Bayesian networks method that could analyze the log data of a tutoring system to determine which items were most effective for learning among a set of items of the same skill. The method was validated by way of simulations with promising results but had been used with few real world datasets. In this work we further evaluate the method by applying it to real world data from 11 experiment datasets that investigate the effectiveness of various forms of tutorial help in a web based math tutoring system. The goal of the method was to determine which questions and tutorial strategies cause the most learning. We compare these results with results from a more traditional hypothesis testing analysis, adapted to our particular datasets. 1.1

The ASSISTment System – a web-based tutoring system

Our datasets consisted of student responses from The ASSISTment System, a web based math tutoring system for 7th-12th grade The original question students that provides preparation for the state standardized test by using released math items from previous tests as questions on the system. Figure 1 shows an example of a math item on the system and tutorial help that is given if the student answers the question wrong or asks for help. The tutorial help assists the student in learning the required knowledge by breaking 1 scaffold each problem into sub questions called scaffolding or giving the student hints on how to solve the question. A question is only marked as correct if the student answers it correctly on the A hint first attempt without requesting help. st

1.2 Item templates in The ASSISTment System Our mastery learning data consists of responses to multiple questions generated from an item Buggy message template. A template is a skeleton of a problem created by a content developer in our web based Figure 1. An example of an builder application. For example, the template ASSISTment item where the would specify a Pythagorean Theorem problem, student answers incorrectly and is given tutorial help.

Learning what works in ITS from non-traditional randomized controlled trial data 3

but without the numbers for the problem filled in. In this example the problem template could be: “What is the hypotenuse of a right triangle with sides of length X and Y?” where X and Y are variables that will be filled in with values when questions are created from the template. The solution is also dynamically determined from a solution template specified by the content developer. In this example the solution template would be, “Solution = sqrt(X^2+Y^2)”. Ranges of values for the variables can be specified and more advance template features are available to the developer such as dynamic graphs, tables and even randomly selected cover stories for word problems. Templates are also used to construct the tutorial help of the template items. Items created from these templates are used extensively in the mastery learning problem sets as a pragmatic way to provide a high volume of items for students to practice particular skills on.

2

The Item Effect Model

We use a Bayesian Networks method called the Item Effect Model [1] for our analysis. This model is based on Knowledge tracing [3] which has been the leading modeling technique used in tracking student knowledge in intelligent tutoring systems for over a decade. Knowledge tracing assumes that the rate of learning for each item, or piece of learning content, is the same. However, the fact that certain content has been shown to be more effective than other content is reason enough to question this assumption. Knowledge tracing is a special case of the Item Effect model, which allows different items to cause different amounts of learning, including the same amount. This is scientifically interesting and practically useful in helping ITS designers and investigators better learn what type of tutoring can maximize student learning. P(L0)

P(TA)

P(TB)

K

K

K

P(GA) A P(SA)

P(GB) B P(SB)

P(GB) B P(SB)

P(L0)

P(TB)

P(TA)

K

K

K

P(GB) B P(SB)

P(GA) A P(SA)

P(GB) B P(SB)

Model Parameters P(L0) = Probability of initial knowledge P(TA) = Probability of learning from item A P(TB) = Probability of learning from item B P(GA) = Probability of guess P(GB) = Probability of guess P(SA) = Probability of slip P(SB) = Probability of slip Nodes representation K = Knowledge node (Hidden variable) {A,B} = Item nodes (Observed variable) Node states K = Two state (Learned, Not learned) {A,B} = Two state (Correct, Incorrect)

Figure 2. An example two sequence topology of the Item Effect Model with two item types and descriptions of the model’s knowledge and performance parameters.

4


The Item Effect Model, depicted in Figure 2, allows for a learning rate as well as a guess and slip parameter per item, to be learned. The guess rate is the probability that a student will answer an item correctly even if they do not know the skill involved. The slip rate is the probability that a student will answer an item incorrectly even if they know the skill involved. The Item Effect Model looks for when a student is believed to have transitioned from the unlearned state to the learned state, which is generally indicated by incorrect answers followed by correct answers to a series of items. The model credits the last item answered with causing this learning. If the model observes a pattern of learning that frequently occurs after a particular item, that item is attributed with a higher learn rate than the other items in the problem set being considered. The probabilities of learning associated with the items are relative to the other items in the problem set and are indicative of a particular item’s ability to cause positive performance on the other items. Because of the fashion in which the Item Effect Model looks for patterns of learning, it requires that the items in a problem set be given to students in a random order. The Item Effect Model in fact models every permutation of a set of items. This means that when analyzing a 3 item problem set, all 6 permutations of sequences are modeled. Randomization is required much in the same way that an RCT requires randomization in the assignment of conditions. Hypothesis testing nor the Item Effect Model can identify learning effects from a single linear sequence. In order to determine the statistical reliability of differences in learn rates, the data is randomly split into 10 bins of equal size. The method is run on each of the 10 bins and the probability of one item having a higher learning rate than another in N or more bins is determined by the binomial function: 1-binocdf(N-1,10,0.50). In this paper an N => 8 will be considered statistically significant.

3

Analyzing Experiments in Mastery Learning Content

Mastery learning [2] is a method of instruction where by students are only able to progress to the next topic after they have demonstrated mastery of the current topic. In the cognitive tutor [4], mastery is when their knowledge tracing model believes the student knows the topic with 0.95 or better probability. Items are often selected in a random order and students spend a lot of time in this type of instruction making it especially lucrative to mine. If a student gets an item wrong or requests help she is given tutorial feedback aimed at improving understanding of the current topic. This is a currently untapped opportunity for investigators to be testing different types of tutorial feedback. By making multiple types of tutorial feedback that can be chosen at random by the system when a student begins an item, investigators can plan experiments to test and embed those experiments within mastery problem sets. After data is gathered, the student response sequences can be analyzed and the learning rates of each strategy calculated using the Item Effect Model. A statistical significance test is employed with the method to tell investigators the probability that their result occurred by chance. We ran 5 such tutorial feedback experiments embedded in mastery content. The following sections will show how this data was analyzed with the model and the conclusions made.


3.1

The tutorial feedback experiments

We planned out five experiments to investigate the effectiveness of various types of tutorial feedback. The choices of feedback types were selected based on past studies [5, 6, 7] of effective tutor feedback and interventions that have been run on The ASSISTment System. To create the experiments we took existing mastery learning problem sets from various math subjects and created two types of feedback conditions for each item in the problem set. The two types of feedback corresponded to the conditions we had planned for that experiment. This authoring process was made less tedious by utilizing the template feature described in section 2.1. Table 1. The five planned mastery tutoring experiments and a description of their subject matter and the two types of tutor feedback being tested. Experiment # 1 2 3 4 5

Condition A Solution (steps) Solution Hints Solution (steps) Solution (steps)

Condition B TPS Worked Example TPS Solution TPS

Subject Matter Ordering fractions Finding percents Equation solving (Easy) Equation solving (Medium) Equation solving (Hard)

There were five types of tutorial feedback tested. This is a description of each: ─ TPS: Tutored Problem Solving [5, 6, 7]. This is a scaffolding type of feedback where students are asked to solve a number of short problems that help the student solve the original harder problem. ─ Worked Example: In this condition [7] students are shown a complete solution to a problem similar to the one they were originally asked to solve. ─ Solution: In this condition [5] students are shown a complete solution to the exact question they were originally asked to solve. ─ Solution (steps): In this condition [5] students were shown a complete solution to the problem they were originally asked to solve but broken up in to steps. The student needed to click a check box confirming she or he had read the current solution step to move on. ─ Hints: In this condition [6] students were given text based hints on how to solve the problem. Students had the opportunity to attempt to answer the original question again at any time. If the student asked for additional hints, the hints would start informing the student exactly how to solve the problem. The last hint would tell them the correct answer to the original problem. 3.2

Modeling mastery learning content containing multiple types of tutoring

We adapted the Item Effect Model to suit our needs by making small changes to the assumption of what an item represents. In the standard Item Effect Model, an item directly represents the question and tutorial feedback associated with that item. Since we were concerned with the effectiveness of multiple types of tutorial feedback for the same items, we let the tutorial strategy that was selected for the student to see be the representation of an item.

6


For example, suppose we had a mastery learning problem set that used items from two templates, 1 and 2, and we also have two types of tutor feedback, A and B, that were available in both templates. We might observe student responses like the ones in Table 1. Table 2. Example of mastery learning data from two students Student 1 response sequence Student 1 item sequence

0 1.A

0 2.B

1 1.B

Student 2 response sequence Student 2 item sequence

0 1.B

1 1.A

1 1.B

Table 1 shows two students response patterns and the corresponding item and tutorial feedback assigned (A or B). Student 1 answers the first two items incorrectly (0) but the last one correctly (1). Student 2 answers the first item incorrectly but the remaining two correctly. If we assume both students learned when they started to answer items correctly then we can look at which tutorial strategy directly preceded the correct answers and credit that tutorial strategy with the learning. In the example data above, tutorial feedback B precedes both students’ learning. In essence, this is what our Bayesian method does to determine learning rates of types of tutorial feedback, albeit in a more elegant fashion than the observational method just described. For analyzing the learning effect of different types of tutor feedback we assume all items in a problem set to have the same guess and slip rate since the guess and slip of an item is independent of its tutorial feedback. For simplicity and to keep the amount of data the Bayes method used close to that of the traditional method, only students’ first three responses in the problem sets are used. 3.3

Traditional hypothesis test used

In order to compare the Bayesian analysis to a more traditional learning gain approach, we came up with the following method to determine which tutorial feedback produced more learning. First we selected only those students who answered their first question incorrectly because these are the only students that would have seen the tutoring on the first question. These students were then split into two groups determined by the type of feedback (condition A or B) they received on that question. We let the second question pose as a post-test and took the difference between their first response and their second response to be their learning gain. Since only incorrect responses to the first question were selected, the gain will either be 0 or 1. To determine if the difference in learning gains was significant we ran a t-test on student learning gains of each condition. We do not claim this to be the most powerful statistical analysis that can be achieved but we do believe that it is sound, in that a claim of statistical significance using this method can be believed.


3.4

Results

The Item Effect Model inferred learning rates for each condition per experiment. The condition (A or B) with the higher learning rate for each experiment was chosen as Best. For the traditional analysis the condition with the higher average gain score was chosen as Best. Table 2 shows the results of the analysis including the number of students in each experiment and if the two methods agreed on their choice of best condition for the experiment. Table 3. Analysis of the five tutoring experiments by the traditional hypothesis testing method and the Item Effect Model. The methods agreed on the best condition in 5 of the 6 experiments. # 1 2 3 4 5 5*

Users 155 302 458 278 141 138

Traditional analysis Best Significance B 0.48 A 0.52 B 0.44 A 0.57 B 0.63 B 0.69

Agree? Yes Yes No Yes Yes Yes

Best B A A A B B

Item Effect Model analysis Significance Guess Slip 0.17 0.28 0.07 0.17 0.11 0.14 0.62 0.18 0.09 0.38 0.15 0.17 0.62 0.15 0.13 0.05 0.14 0.17

The traditional analysis did not find a significant difference in the conditions of any of the experiments while the Item Effect Model found a significant difference in one, 5*. The reliably better condition for this experiment was B, the tutored problem solving condition. This was also the experiment where students who saw condition A were given a bad solution due to a typo. These results show that the Item Effect Model successfully detected this effect and found it to be statistically significantly reliable while the traditional method did not. Experiment 5 represents the data from students after the typo was fixed. Condition B is still best but no longer significantly so. Also included in the table are the guess and slip rates learned by the Item Effect Model for each experiment. It is noteworthy that the highest guess rate learned of 0.28 was for the only experiment whose items were multiple-choice (four answer choices). It is also noteworthy to observe that the guess values for the easy, medium and hard equation solving experiments (3-5 respectively) decrease with the increasing difficulty of the problem set. These observations are evidence of a model that learns highly interpretable parameter values; an elusive but essential trait when the interpretation of parameters is informing pedagogical insights.

4

Analyzing RCT designs with feedback on all items

We wanted to investigate the performance of our model on data that took the form of a more traditional RCT. Instead of running five new randomized controlled trials, we searched our log data for regular problem sets that were not intended to be RCTs but that satisfied a pre-test/condition/post-test experimental design. For this analysis, the importance was not which particular tutor feedback was being tested but

8


rather the model’s ability to find significance compared to hypothesis testing methods that were designed to analyze data similar to this. 4.1

Looking for RCT designs in the log data of randomized problem sets

The data used in this analysis also came from The ASSISTment System. We looked for random item order problem sets of size four where all items related to a similar skill. We also required that a student complete all the items in the problem set on a single day. Once we identified such problem sets we identified pairs of item sequences in which the first and third questions presented to the student were fixed. The remaining two questions would serve as the two conditions. We examined all such pairs of sequence in which each sequence had been completed by at least fifty students. The five problem sets with the most data were used as the experiments. 4.2

The inter-rater plausibility study

We wanted to provide a human point of reference with which to compare results against. To do this we selected four educators in the field of mathematics as raters. They were told which two items were used as the pre and post-test and which were used as the conditions. They were also able to inspect the feedback of all the items and then judge which of the two conditions were more likely to show learning gains. 4.3

Modeling learning in problem sets with RCT data sequences

One difference between the modeling of these datasets and the mastery learning datasets is the reduction in sequences and the decision to let each item have its own guess and slip value. Observing only two sequence permutations is not the ideal circumstance for the Item Effect Model but represents a very common design structure of experiment data that will serve as a relevant benchmark. 4.4

Statistical test

Since this particular data more closely resembled an RCT, we were able to use a more familiar learning gain analysis as the traditional hypothesis testing method of comparison. Learning gain was calculated by taking the post-test minus the pre-test for each student in their respective condition. To calculate if the learning gains of the two conditions were statistically significantly different, a t-test was used. 4.5

Results

Table 3 shows the results of the two analysis methods as well as the best condition picks of the four raters in the subject matter expert survey. For each experiment the condition groups were found to be balanced at pre-test. There was one experiment, #4, in which both methods agreed on the best condition and reported statistically significant difference between conditions.


Table 4. Analysis of the five RCT style experiments by the traditional hypothesis testing method and the Item Effect Model. The methods agreed on the best condition in 5 of the 6 experiments and agreed on statistical significance in one experiment. # 1 2 3 4 5

Users 149 197 312 208 247

Traditional analysis Best Significance A 0.74 A 0.36 A 0.04 A 0.00 B 0.44

Agree? Yes Yes No Yes Yes

Item Effect Model Best Significance A 0.95 A 0.62 B 0.38 A 0.05 B 0.82

1 A A A A A

Rater Picks 2 3 4 A A A A A A A A B A A A A A A

The subject matter experts all agreed on four of the five experiments. On the experiment where one rater disagreed with the others, #3, the majority of raters selected the condition which the traditional method chose as best. This experiment was also one in which the two methods disagreed on the best condition but only the traditional method showed statistical significance. On the experiment in which both methods of analysis showed a statistical difference, #4, there total agreement between our subject matter experts and both of these methods. On average the traditional method agreed more with the raters choices and also found significance in two of the experiments where as the Item Effect Method only found significance in one. However, a correlation coefficient of 0.935 was calculated between the two methods’ significance values indicating that The Item Effect method’s significance is highly correlated with that of the hypothesis testing method for these RCT style datasets.

5

Contributions

In this worked we presented an empirical validation of some of the capabilities and benefits of the Item Effect Model introduced by Pardos and Heffernan (2009). We conducted five original experiments comparing established forms of tutorial feedback with which to compare the analysis of a traditional hypothesis testing method to the Item Effect Model. In conducting the experiments by inserting multiple tutorial feedback conditions into items in mastery learning problem sets, we were able to demonstrate an example of running an experiment without interrupting the learner’s curriculum and without giving them at times lengthy pre-tests that prohibit feedback which places students’ learning on hold. A second contribution of this work was to highlight how random orderings of questions could be analyzed as if they were RCTs. This is a powerful idea, that simply randomizing the order of items creates intrinsic experimental designs that allow us to mine valuable pedagogical insights about what causes learning. We suspect that the methodology we used to calculate significance in the Item Effect Model could be made more powerful but decided to air on the side of caution by using a methodology we were sure was not going to lead us to draw spurious conclusions. It was encouraging to see how the Item Effect Model faired when compared to a more traditional hypothesis testing method of analysis. The model agreed with the traditional tests in 9 out of the 11 experiments and detected an effect

10


that the traditional method could not; a typo effect in a condition of one of our mastery learning problem set experiments. We believe that the implication of these results is that ITS researcher can safely explore their datasets using the Item Effect Model without concern that they will draw spurious conclusions. They can be assured that if there is a difference in learning effect there is a reasonable chance it can be inferred and inferred from most any source of randomized item data in their system.

Acknowledgements We would like to thank all of the people associated with creating the ASSISTment system listed at www.ASSISTment.org including investigators Kenneth Koedinger and Brian Junker at Carnegie Mellon. We would also like to acknowledge funding from the US Department of Education, the National Science Foundation, the Office of Naval Research and the Spencer Foundation. All of the opinions expressed in this paper are those solely of the authors and not those of our funding organizations.

References 1. Pardos, Z. & Heffernan, N. (2009) Detecting the Learning Value of Items in a Randomized Problem Set. In Dimitrova, Mizoguchi, du Boulay & Graesser (Eds.) Proceedings of the 2009 Artificial Intelligence in Education Conference. IOS Press. pp. 499-506. 2. Corbett, A. T. (2001). Cognitive computer tutors: solving the two-sigma problem. In M. Bauer, P. Gmytrasiewicz, & J. Vassileva (Eds.) User Modeling 2001: Proceedings of the Eighth International Conference, UM 2001 (pp. 137–147). New York: Springer. 3. Corbett, A. T., & Anderson, J. R. (1995). Knowledge tracing: modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4, 253–278. 4. Koedinger, K. R., Anderson, J. R., Hadley, W. H., & Mark, M. A. (1997). Intelligent tutoring goes to school in the big city. International Journal of Artificial Intelligence in Education, 8, 30–43. 5. Razzaq, L. & Heffernan, N. (2009) To Tutor or Not to Tutor: That is the Question. In Dimitrova, Mizoguchi, du Boulay & Graesser (Eds.) Proceedings of the 2009 Artificial Intelligence in Education Conference. IOS Press. pp. 457-464. 6. Razzaq, L. & Heffernan, N.T. (2006). Scaffolding vs. hints in the Assistment system. In Ikeda, Ashley & Chan (Eds.). Proceedings of the Eight International Conference on Intelligent Tutoring Systems. Springer-Verlag: Berlin. pp. 635-644. 7. Kim, R, Weitz, R., Heffernan, N. & Krach, N. (2009). Tutored Problem Solving vs. “Pure”: Worked Examples In N. A. Taatgen & H. van Rijn (Eds.), Proceedings of the 31st Annual Conference of the Cognitive Science Society (pp. TBA). Austin, TX: Cognitive Science Society.