Wadsworth International Group, Belmont, California. [6] Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San. Mateo ...
Tabling Method – A simple and practical complement to Knowledge Tracing Qing yang Wang, Zachary A. Pardos and Neil T. Heffernan {wangqy, zpardos, nth}@wpi.edu Computer Science Department Worcester Polytechnic Institute
Abstract. Bayesian Knowledge tracing (BKT) is a widely used technique in many kinds of cognitive tutor systems. On nature way to extend KT is trying to modify this assumption. So in this paper, we propose a method naming "fold tabling” which can utilize the whole sequence of observable state variables in the hope of improving the prediction accuracy. By training and testing on our Mastery learning dataset, although the result shows the new model's prediction accuracy is slight worse than BKT's, the covariance shows our model is significant different to BKT. And by ensemble two predictions, it did reliably increased prediction accuracy.
1. Introduction ITS (Intelligent Tutoring System) [1] aims to improve student learning using reliable assessment. For decades, various methods have been developed to model student behavior, and the most well-known one is Knowledge Tracing model (Corbett and Anderson 1995) [2], which uses Dynamic Bayesian Network to model student learning using performance as observation and knowledge state as latent variable. The model successfully followed the mastery learning concept and did very well in various ITS applications. Knowledge tracing can be viewed as a first-order Markov-Process, and follow this suggestion, knowledge only gained within 2 performance step, i.e. no longer step learning can be achieved in this process. While this assumption can greatly reduce computation complexity, it wasn't seems so obvious in real life. In intuition, not only the previous answer but also all the preceding answers may have some effect on student learning. So in this paper, we explored a new method trying to use the whole sequence of performance variables in the hope to better predict student performance. In the rest of this paper, we will first describe the tutoring systems and the data we are using in section 2, analysis, comparison and combination our model and BKT will be in section 3, the result tested on dataset for our model compared with BKT will be shown in section 4, last but not least, the discussion on our conclusion and future work will be covered in the last section.
2. Dataset description The mastery learning dataset used in this paper came from math ASSISTments[3], a web-based tutoring system provide for 4th to 10th grade mathematics. We used data conducted in 2009. The mastery learning is a strategy that requires students to continually work on a problem set until they have achieved a criterion. The data was collected in a suburban middle school in central Massachusetts.
All together we pick 14 mastery learning problems sets nd each question will have 800 different student performance on average.
3. Model description BKT Knowledge Tracing model has been widely used in ITS to model student knowledge and performance over time slice. As shown in Fig, Knowledge tracing is typical 2 variables HMM, with one latent and one observable. There are together 5 parameters for each skill: initial knowledge is for student knowledge prior knowledge distribution P (Knowledge_0), probability of learning the skill and probability of forgetting the skill (usually treat as 0) are the parameters in transition model P (Knowledge_t+1|Knowledge_t), probability of guessing correctly while student doesn't know the skill and probability of slipping while student know the skill are the parameters for the sensor model P(Performance_t|Knowledge_t)
Fig 1: Knowledge Tracing Model
Fig 2: BNT_SM
In our experiments, we used Bayes Net Toolbox for student modeling to implement Knowledge Tracing, and the Expectation Maximization (EM) algorithm to fit the model to the data set. EM algorithm finds a set of parameters that maximize the likelihood of the data by iteratively running an expectation step to calculate expected likelihood given student performance data a maximization step to compute the parameters that maximize the expected likelihood. The drawback for EM is it is not optimized so it will stick in local maxima. In our experiments, we set the initial parameters as follow: initial knowledge = 0.68; guess = 0.38; slip = 0.07; learning = 0.14. As final parameter and result showed, our model is reasonable using these initial parameters
Fold Tabling The tabling method comes from the simply intuition that using percent correctness as a representation to predict student performance. To extend percent correctness into a more specific prediction for a
typical student, we thought of take several more response into account. Thus, e.g. there is a student with previous responds of '2121' (2 for correct answer and 1 for incorrect answer). We first use percent correctness P (performance|'') as a prediction, then we add another prediction of percent correctness given previous answer was incorrect P (performance|'1'), follow this same rule, we can decompose all his responds and gain other three prediction P (performance|'21'), P (performance|'121'), P (performance|'2121'). Then we should combine all this five predictions together in order to calculate overall prediction. We tried several average methods and it will be covered later. Above we describe exactly how we can inference the student prediction given we have this entire "percent correctness" table. Next we will show how to build this table from training data set: The idea is simply, to ensure these percent correctness reasonable, statistically we need lots of data. So we try to utilize the whole sequence appear in the training set. E.g. in training set there is a student with 5 responses which are "12122" (2 for correctness and 1 for incorrectness). The fold table
Numbers of Correct
Numbs of Incorrect
‘’ 1
3 (1->2) 2
2 (1->1) 0
2
(2->2) 1
(11->1) 0
11
(11->2) 0
(11->1) 0
12
(21->2) 1
(21->2) 1
21 22
(21->2) 1 (22->2) 0
(21->1) 0 (22->1) 0
Percent Correctness
0.6 P(x=2|'') 1.0 P(x=2|'1') 0.5 P(x=2|'2') ? P(x=2|'11') 0.5 P(x=2|'12') 1 P(x=2|'21') ? P(x=2|'22')
Weight : 1.More instances more reliable 2. More discriminati ve more valuable
2121->x P(x=2) = P(x=2|'') * weight + P(x=2|'1') * weight + P(x=2|'21') * weight + P(x=2|'121') * weight + P(x=2|'2121') * weight Fig 3: Example for a student with performance of “2121”
… Table 1: Example for a student with performance of “12122” for training We first seek all sub strings in '12122' with length of 1, so we got 3 '2' and 2 '1'. That we update our 'table' P (performance |'') = 3 out of 5 chance to get '2'. Basically this is exactly the same way we calculate percent correctness. Then afterward, we seek all sub strings in '12122' with length of 2, so we got 2 „12‟, 1 '21', 0 '11' and 1 '22'. We update our 'table' and got P (performance|'1') = 2 out of 2 chance to get correct (remember this is just for one student, overall this percentage 'will' become reasonable) and P (performance|'2') = 1 out of 2 chance to get correct. Then we can continue on to do this for all sub string of length of 3 and so on. In inference part we have mentioned we tried several ways to combine all the percent correctness we can find. We followed these two criterions: favor the table with more instances, favor the table with more discriminative power. The first way is simply average them, which in technic is just setting the table weight all as 1.This way view all the percentage correctness are create equal if there exist one.
In the second way we used beta distribution to describe this percent correctness, in intuition that the table with more instance is more reliable than the case with less instance. E.g. using beta distribution, for case 2 out 4, Weight (P (knowledge=2) = 0.5) = 0.1863, while for case 20 out of 40 weight (P (knowledge = 2) = 0.5) = 0.4816. (This is reasonable, consider in the coin flipping scenery and to predict weather the coin is balanced. More experiments will gradual make the assumption more convincing.) Yet another way is we should treat some discriminative percent correctness more “equal” than others. That is for the prediction far away from the threshold (1.5 in our binary situation with 1 and 2 values), the prediction are more value (contain more detail information in another way) than the prediction around 1.5. So we can set weight according to this. Two addition ways are brought from the decision tree theory, one is Gini impurity which was used by the CART algorithm which can measure the probability that a randomly placed item will be in the wrong category. Another is entropy, which is the sum of p(x)log(p(x)) across all the different possible results.
Ensemble two models We also consider combine our model with normal KT in the expectation to gain more accurate result. We consider mainly two types: marginal one most probably evidence one. The result can be found in next section. Since our new tabling model is considerable different from KT model (this is also given in next section), we hope by ensemble them, and we could get gain bias without increase more variance. The result showed both types of ensemble method work very well. For marginal one, we first simply take average; Pardos & Heffernan (2010) demonstrated that a simple average technique can lead to higher prediction accuracy than either of the two methods. Similarly. We average the tabling model and knowledge tracing together. Presumably, if two group shares uncorrelated errors, we can get lower error by simply averaging them. Additional to average, we can add up maximum, minimum logic in it. I.e. If PT (Prediction from tabling method) and PK (Prediction from KT model) are both above a threshold (say 1.5) that is, both models predict that the chance to have it correct is greater than have it incorrect. So we set max of PT and PK as the result. Then reversely, set min of PT and PK as the result if both PT and PK smaller than threshold. In other cases, that is PT PK does not agree to each other, use the average as the result. (Or use square average which favor the one farther way from the threshold) In the most probably evidence case, we basically follow the second rule, but instead set max and min, we use 1 and 0. For the sake of 'fairness', we compared this result with MPE result of tabling method and KT method. We also compare this average way with Boost and SVM.
4. Result The result show by following three parts:
The three version of tabling method; comparison between KT and tabling; ensemble model compared to KT and tabling 1) Comparison between three tabling method RMSE
All Equal Beta Distribution Favor More Discriminative One Entropy Gini impurity
TTest Comparing each model to the “All Equal” Model
0.369201 0.369192
0.25
0.369212
0.16
0.369224 0.369210
0.20 0.27
Table 2: Comparing among different weight method (sire we report ttest values in the same way we did with KDD Cup data. We recognize that the inpdepnce assupmetion of the ttest are not. (lets do some student s level something We can see using beta distribution perform is a bit better than simply average and favor more discriminative one is a bit worse. But the result's difference is not significant. Although it is the fact that our data might not so reprehensive, the sparse property of our table might be the major drawback and this lead to some thought in improvement we covered in the last section. And sadly the entropy and Gini impurity metrics did not have much improvement either. 2) Comparison between KT and tabling
KT Table(All Equal)
RMSE
TTest
TTest on Residual
0.3616 0.3692
0.03
0.05
Table 3: comparing KT method and Tabling We can see table is bit worse than KT, and both the TTest for the prediction value and the residual show that the difference between two models is significant. 3) comparison among ensemble ones with KT and Tabling Marginal Evidence Method RMSE TTest (compare with KT)
TTest on Residual(with KT)
KT Table
0.3616 0.3692
0.0293
0.0507
Average
0.3621
0.029
0.421
Average_Min_Max
0.3546
0.000
0.001
Table 5: Comparing KT and ensemble Tabling method (without threshold) Most probably evidence Method
RMSE
KT Table Average
0.3356 0.3275 0.3105
TTest (compare with KT)
TTest on Residual(with KT)
0.000 0.000
0.328 0.117
Table 6: Comparing KT and ensemble Tabling method (with threshold) We find it is very interesting just using threshold; the table seems perform better than KT and both result shows that ensemble the KT and Tabling can perform better than raw KT and Tabling method.
5. Discussion and Future Work Because the 2 weighted way treat different to those table lack of instances .so the 2 weighted way did not show significant difference imply that the longer sequence seems not so import compare to short sequence. So one thing left to explore is to find a threshold to maintain a portion of this table to without generalization and reduce parameter complexity.
6. Contribution In this paper, we introduced a fold table way to model and predict student‟s performance. Now we only tried using percent correctness as the only feature for this method and it turn out perform quite well compare with KT. Although this “tabling” framework seems naïve, it is actually quite flexible and powerful in gather more features to make prediction. We also tried several methods to combine our model with Knowledge Tracing, as the result shows, although some nature property of our dataset is not clear (the fact that RMSE drop significantly when using threshold) the combination of our model with KT achieve best performance.
References [1] Koedinger, K. R., Anderson, J. R., Hadley, W. H., & Mark, M. A. (1997). Intelligent tutoring goes to school in the big city. International Journal of Artificial Intelligence in Education, 8, 30–43.
[2] Corbett, A. T., & Anderson, J. R. (1995). Knowledge tracing: modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4, 253–278. [3] Pardos, Z. A., Heffernan, N. T. In Press (2010) Modeling Individualization in a Bayesian Networks Implementation of Knowledge Tracing. In Proceedings of the 18th International Conference on User [4]Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, Alex Ksikes In Proceedings of the 21st International Conference on Machine Learning Ensemble selection from libraries of models [5]Leo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone (1984). Classification and Regression Trees. Wadsworth International Group, Belmont, California. [6] Ross Quinlan (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA. [7] Yu tao Wang, Neil T. Heffernan The “Assistance” Model: Leveraging How Many Hints and Attempts a Student Needs