Using a Learning Agent with a Student Model - CiteSeerX

0 downloads 0 Views 83KB Size Report
skills of finding the least common multiple, converting the operands to .... the subskills necessary to solve the problem, she will require a long time to solve it.
Using a Learning Agent with a Student Model Joseph E. Beck and Beverly Park Woolf Center for Knowledge Communication Department of Computer Science University of Massachusetts Amherst, MA 01003 U.S.A. fbeck,[email protected]

Abstract. In this paper we describe the application of machine learning to the problem of constructing a student model for an intelligent tutoring system. The proposed system learns on a per student basis how long an individual student requires to solve the problem presented by the tutor. This model of relative problem difficulty is learned within a “two-phase” learning algorithm. First, data from the entire student population are used to train a neural network. Second, the system learns how to modify the neural network’s output to better fit each individual student’s performance. Both components of the model proved useful in improving its accuracy. This model of time to solve a problem is used by the tutor to control the complexity of problems presented to the student.

1 Introduction MFD (Mixed numbers, Fractions, and Decimals) is an intelligent tutoring system (ITS) that teaches arithmetic to fifth and sixth graders. This system adapts its instruction to meet the needs of each learner by intelligently selecting a topic on which the student should work, providing hints that match the student’s level of ability, and dynamically constructing problems that are appropriate for the student [2]. The system represents the domain as a topic network. A topic refers to a large unit of knowledge, such as “subtract fractions” or “multiply whole numbers.” Each topic has a list of pretopics; before the student can work on a topic, all of its pretopics must be passed. In addition, the system knows a topic’s subskills, which are all of the individual steps that must be performed to solve a problem of a given topic. For example, in the problem 13 + 21 the topic is “add fraction.” This topic has subskills of finding the least common multiple, converting the operands to equivalent denominators, simplifying the result, and making the fraction proper. However, it is not necessarily true that all subskills will have to be applied to solve a particular problem. In this case, the student will not have to simplify her answer or make the result proper. In this paper, we describe an analysis of an evaluation of MFD that took place in Winter 1996 and Spring 1997. The machine learning techniques described here were added post hoc, and have been applied to the data gathered during the evaluation. We are working to incorporate these mechanisms into the next version of the tutor.

Currently, the MFD tutor adjusts problem difficulty via a complex, but ad hoc mechanism [2]. The system rates a problem’s complexity on an absolute scale, builds problems that are at a “good” level of difficulty for each student and require the student to apply skills in which she needs more practice. Rather than controlling the problem’s complexity directly, the new system will construct problems that require a certain amount of time for the student to solve. This is important, since if a problem takes too long to solve, the learner may become frustrated and lose confidence in her math abilities. MFD determines if the student is not making sufficient progress on the current problem, and if so generates an easier problem which she can handle. For example, a student who is very confident in her abilities or very proficient could be given problems that take a considerable amount of time to solve. A student who has poor math skills, or who has low self-confidence in her math skills may become frustrated or give up if a problem requires too much time to solve. This student should be given gradually more difficult problems. Since one of the goals of the MFD system is to increase girls’ confidence in mathematics [3], the tutor periodically presents the student with questionnaires [5] to rate a student’s self-confidence. So it is certainly possible to add a measure of self-confidence to the system’s student model. Predicting the amount of time students require to solve a problem is difficult. First, the overall skill level of the student must be accounted for. If two students are given identical problems, the more skilled student will solve the problem more quickly than the less skilled student. Second, problems vary in difficulty; a student can solve 13 + 31 7 much more quickly than she could solve 10 11 + 12 . Third, there are considerable individual differences in how quickly students solve problems, and these differences are not a component of most student models. For example, some students navigate the keyboard more quickly than others, some students have the multiplication tables memorized while others must use pencil and paper, etc. Any of these factors can impact the time required to solve a problem. Finally, the time required to solve a problem is a noisy variable. That is, if a student is given the same problem twice, she may take very different amounts of time to solve it. A small mistake at the beginning of the problem solving process can drastically impact the time required to solve a problem. For these reasons, we are using a machine learning (ML) agent to predict the amount of time students require to solve a problem. We are using function approximators due to their being robust with respect to noisy data [6]. Additionally, with a learning agent, it is unnecessary to specify a priori how the prediction task should be accomplished; the learning agent is provided with the data and allowed to construct its own theories.

2 Related work ML techniques have been applied to the construction of student models. Chiu and Webb [4] used C4.5, a decision tree algorithm, to construct a set of rules for determining if a student had a particular misconception. Results of this architecture were mixed: the learning agent was capable of successfully classifying student misconceptions but had difficulties deciding between multiple possible “bugs” in the student’s knowledge. ML has also been used to select which high level teaching action to perform next. Quafafou et al. [7] built a prototype system NeTutor. This system constructed tables of

rules that would learn which types of interactions (guided, discovery learning, examples, etc.) enabled each student to learn best. Prior work applying ML techniques to ITS have concentrated on learning features that are already a part of student models. We use the flexibility of machine learning to include other data that is not traditionally used, such as problem difficulty and student beliefs about their abilities. ML allows us to use this knowledge without knowing ahead of time how it is to be used.

3 Methodology The MFD tutor has been used in three fifth and sixth grade classrooms in suburban and urban settings. Students enjoyed working with the system, and girls’ self-confidence in their math ability increased as a result of using it [3]. 3.1 Data collected While the grade school students were using the tutor, the system gathered a variety of data: the tutor’s belief of the student’s ability at solving the topic and her average score in the required subskills (range of [1...7]), the student’s estimates of her ability at the topic and its required subskills (integers in the range [1...7]), data about the problem type (addition or subtraction), the problem’s estimated level of difficulty (derived from the domain model, range of [0...7]), and the student’s gender. The system only considers the student’s ability on subskills that are related to the current problem. For example, in 12 ? 13 , students would have to find a least common multiple of 2 and 3, and then convert 13 and 12 into fractions with equivalent denominators. However, students would not have to convert the answer to proper form or simplify the answer, so these subskills would be unused. The learning algorithm is given the student’s self-assessments of her abilities. The system periodically asked the student to evaluate her proficiency in various domain topics. Averaged across all students, there was no linear relationship between their selfassessment and their actual performance. However, on a per student basis, students misestimated their abilities in a fairly systematic way. For more information see [1]. After the student solved the problem, the time required was stored. Fifty students used the tutor, and solved 1781 problems involving fractions. We used these data to train a learning agent that learns a “population student model” of how long it takes students to solve a problem. So our task is given a student, her student model, and the problem to be solved, predict how long it will take her to solve the problem. 3.2 Learning architecture There are two ML components in our system. The population student model (PSM) is based on data from the entire population of users. The second component operates on a per student basis, and adapts the PSM to better predict each student’s performance. The population student model takes as input characteristics of the student, as well as information about the problem to be solved, and outputs the expected time (in seconds) the student will need to solve that problem. This information is gathered for every

student using the system. It is important to note that this population model is also a student model, since its predictions are dependent upon the student model. Therefore, the predictions will differ for each student. The rationale of training a learning agent with many different students is that it provides a larger set of datapoints. A tutor will interact with each student to only a limited extent. However, by combining all of the tutor’s interactions with each student, a significant amount of training data can be gathered. Even though this population model considers the student’s abilities in making its prediction, its accuracy can still be improved by refining its predictions to better match those of the individual students. For example, in our task, some students could have the same level of ability (as reported by the tutor), yet one may navigate the keyboard more quickly, or one may think faster than another. These differences are not accounted for by our student model, so the learning agent cannot distinguish between these two students. When trying to predict a student’s performance, for any interesting task some information will be missing (otherwise we would have a perfect cognitive model). This missing information ensures there is some room to improve our model. We therefore add a second stage of learning to the system that learns how to fine-tune the PSM’s predictions to better match the student’s actual performance.

4 Population student model We use a neural network with the backpropagation learning algorithm1 to learn the PSM. The choice of which machine learning architecture to use depends heavily on the amount of data available. As the training set grows larger, neural networks become better options as learning algorithms due to their ability to learn a variety of functions. Our network uses 25 input units, 4 hidden units, and 1 output unit. These numbers were discovered via trial and error, as network design is still poorly understood. 4.1 Input units The neural network uses the data listed in Section 3.1 as inputs. Input encoding Neural network inputs typically range from 0 to 1 (inclusive). However, many of our network’s inputs fall outside of this range. One option is to normalize the inputs so they fall between 0 and 1. For example, if a value x ranges from 1 to 7, the 1 network would be given x? 6 as input. The other option is to discretize the input. For example, if x ranges from 1 to 7, a set of 7 input units could be created. If x was 1, the first input would be set to 1, and the other six would be 0. If x was 5, the fifth input unit would be set to 1, while the others would be 0. This requires many input units if x can be large. Another scheme is to consider ranges of x. For example, if x is 1 or 2, then input unit 1 is activated, if x is 3, 4, or 5, then input unit 2 is activated, if x is 6 or 7 then input unit 3 is activated. This is a “one-hot” encoding scheme, since for each feature, only one input unit can be active at a time. We discretized the inputs into ranges since that procedure frequently 1

We thank Jeff Shufelt for making his neural network package publicly available.

works better than normalizing the inputs. This is why our network has 25 input units for relatively few (7) actual inputs.

Intelligent features An important characteristic of our choice of inputs is that the system does not learn things about any specific component of the domain model such “adding fractions.” The inputs are phrased in terms of the student’s proficiency in the topic being tested. For example, consider a student given the problem 14 + 41 , and whose proficiencies at adding fractions and its subskills are given in Table 1.

Table 1. Sample topic and subskill proficiencies Skill Proficiency Rating add fractions 0.6 find the least common multiple 0.3 make equivalent fraction 0.4 simplify 0.1 make fractions proper 0.3

The neural network’s inputs are phrased in terms of “problem’s topic” and “subskills required”. To predict the time required to solve this problem, one input to the network is the proficiency of this topic (which is 0.6). Another input is the average proficiency of the required subskills. Since in our example, the only subskill the student must perform is simplification, that is the only subskill considered, and the value of that input will be 0.1. Constructing features for the neural network in this manner is very powerful. The learning agent does not have to consider additional details. In this example, the student’s ability at subtracting whole numbers is unlikely to be useful, so this knowledge is not provided to the learning agent as this would slow down the learning process. An additional benefit is the ML agent can better generalize what it does learn. In this case, the network will probably learn that if the student has a low proficiency on the subskills necessary to solve the problem, she will require a long time to solve it. This theory can be applied to a variety of topics. So if the system were to encounter a student attempting to solve a subtract mixed numbers problems, and the student had low proficiencies on the needed subskills, the ML agent could conclude the student will require significant time to solve the problem. Restricting what the learning agent can consider is a form of bias, and can potentially enhance the agent’s learning speed and generalization ability. However, there is the potential drawback that this bias causes the agent to ignore some information that is actually useful. Unfortunately, this tradeoff is all too frequent in machine learning.

4.2 Target value The neural network is learning to predict how long students will require to solve a problem. Since time to solve a problem is recorded in seconds, it must be normalized to fall in the range [0...1]. We have decided to discard trials taking longer than 5 minutes, time . This resulted so 300 seconds is the normalizing constant. Thus the target value is 300 :0 in discarding approximately 5% of the dataset. Discarding these outliers is beneficial, as some of these datapoints do not reflect the actual time students spent solving the problem. For instance, some students may have left their computer for a few minutes, or saved the problem and returned to use the tutor the next day, etc. 4.3 Network constants There are several learning parameters that must be determined to train a neural network. First, there is  , which controls how rapidly the network’s weights are adjusted. A high value of  results in rapid initial learning, but the network’s performance will soon plateau or possibly become worse. Small values result in slower initial learning, but the system can potentially attain superior performance. Second is the network’s momentum. With a high momentum, once a network starts to adjust its internal weights in a certain direction, it has a tendency to continue doing so. This speeds its learning, as the network increases its internal weights more rapidly, and is less likely to backtrack and decrease weights that have been increased. However, a high momentum may cause a network to be too “hasty” in its learning and spend considerable time undoing poor early actions. 4.4 Accuracy of PSM To test the accuracy of the PSM, a k-fold cross-validation technique was used. The trimmed dataset consisted of 1587 datapoints, obtained from the logfiles of 50 students. To test the neural network’s accuracy for each student, the network was trained on 49 students’ data and then tested with the remaining student’s data. This procedure was done for each of the students (reinitializing the network between each run). The model’s error was calculated on a per student basis by squaring the difference between the neural network’s predicted time and the actual time the student required. Table 2 contains sample data generated for testing purposes. These data demonstrate a student who solves problems more quickly than the model anticipates. The error rate for the PSM was compared to another function that simply guessed the average time required to solve a problem (approximately 98 seconds). If our ML-agent cannot outperform another agent that simply guesses the average time, there is little point in using it. Indeed, this evaluation procedure helped uncover several bugs in our implementation. Setting  to 0.001, momentum to 0.0, and training the network 2,000 times produced the best model fit, and had 33% less squared error than simply guessing the average (i.e. the neural network accounted for 33% of the variance). For our initial work, we were concerned with speed of learning, so we set both  and momentum to 0.4. Running the network for 10 iterations produced a fairly good model fit, on the order of 25% variance

Fig. 1. Error for each model type.

accounted for. The first two entries in Figure 1 shows the sum-squared error produced by each model. In theory, it is possible to improve the network’s accuracy by altering the number of hidden units or allowing it to train longer. We experimented with having more hidden units, but performance did not improve significantly. It is unlikely that training the network for longer will produce appreciable gains, as most of our simulation runs plateaued at roughly the same level of performance.

5 Adapting the PSM to Individuals The data indicate our model does not perfectly predict student behavior, and there is substantial room to improve its performance. Changing the network architecture or training for more iterations has been ruled out. One option is to augment the network’s inputs. As mentioned previously, an ITS’ student model does not contain much of the information that determines how quickly students solve problems. It is certainly possible to measure such things as typing ability, speed at multiplying simple numbers, etc. and to derive a speed estimate. However, it may be more efficient for the system to learn how to tweak the outputs of the neural network to better account for each student’s performance. Since this adjusting is on a per student basis, relatively few datapoints will be available. Therefore, this “adjustment function” should be fairly simple to learn. The PSM is learned off-line from an entire population of data. However, the adjustment function can be learned on-line while students are using the ITS. We feel it is necessary to adjust the PSM to individuals to minimize frustration. If a student spends a large amount of time on a problem, it is likely she will become frustrated and not consider that her slowness is due to problems using the keyboard or difficulty remembering the multiplication tables. In the future, MFD will be more interactive, and this model will determine how much scaffolding the student will be given.

5.1 Adjustment functions We selected two simple operations to modify the neural networks predictions: adding a constant factor to its output, and multiplying its output by a constant factor. Either of these operations can account for students who solve problems more quickly/slowly than their student models would otherwise indicate. These adjustments were not applied simultaneously; we tested each independently. Table 2. Error in the neural network’s predictions of a sample student’s performance Times Total Actual times 10 35 30 50 10 40 25 35 20 40 295 Predicted times 20 50 40 30 25 55 10 70 15 70 385 Error 100 225 100 400 225 225 225 625 25 900 3050

total time?predicted time The formula for the additive constant is , and for the mulnumber of cases total time tiplicative constant it is predicted time . ?385 = For the data in Table 2, the multiplication factor used for correction is 29510 ?9. So each of the neural network’s estimates would be lowered by 9 seconds for this student. Table 3 shows the result of this procedure, and its effect upon the error. To correct for the overestimation, the multiplicative factor is 295 385 = 0:77. So, the neural network’s outputs would would be multiplied by 0.77. Table 4 shows the result of this process. The error has been reduced from 3050 to 1786. Table 3. Sample error when adjusted by an additive constant.

Predicted 20 50 40 With additive constant 11 41 31 Actual 10 35 30 Erroradd 1 36 1

30 21 50 841

Times 25 55 10 16 46 1 10 40 25 36 36 196

70 61 35 676

15 6 20 196

70 61 40 441

Total 385 295 295 2460

This example demonstrates that substantial reductions in the error rate are possible. The additive factor reduced error from 3050 to 2460, a reduction of 20%. The multiplicative factor reduced error from 3050 to 1786, a reduction of 41%. These factors are not learned “on the fly”, since they are calculated based on all of the observed instances, but serve to illustrate that fairly simple adjustments to the PSM can have a profound impact on its accuracy. It would be straightforward to alter these functions to work on-line. 5.2 Analysis of individualized models Figure 1 shows that the multiplicative and additive factors both improve model fit. The leftmost element in the graph shows the error that remains when using a model that

Table 4. Sample error when adjusted by a multiplicative factor.

Predicted 20 50 40 With multiplicative constant 15 39 31 Actual 10 35 30 Errormultiply 25 16 1

30 23 50 729

Times 25 55 10 19 42 8 10 40 25 81 4 289

70 52 35 381

15 12 20 64

70 54 40 196

Total 385 297 295 1786

always predicts students will take the average amount of time to solve a problem (98 seconds). The next element is the error from the PSM, which is the neural network described in Section 4. The final two elements in the graph are the results of adjusting the neural networks output using the simple adjustment functions described in Section 5.1. A perfect model would have 0 error. Each additional feature of the model reduces error, which provides support for our hypothesis of multiple layers of machine learning. The multiplication adjustment outperformed adding a number to the neural network by a slight amount, but overall both functions are comparable. Roughly 30% of the variance left from the PSM was accounted for by the adjustment functions. However, even the best combined models account for about 50% of the variance, and thus have significant room for improvement. We have also constructed a PSM using linear regression. This accounted for approximately 22% of the variance on this dataset [1]. The current results are superior in three ways: (1) The linear method was not tested with a cross-validation, but was trained and tested using all datapoints. This overestimates its accuracy. (2) The NN is more flexible in what it can represent, and with additional data is likely to outperform regression by a larger margin. (3) The NN accounts for more variance than the regression model, especially if it is trained for a long time.

6 Conclusions and future work Our two-phase learning architecture accounts for roughly 50% of the variance in the amount of time students require to solve a problem. The neural network PSM outperforms simply guessing the average and regression based techniques, and the neural network with adjustment function outperformed the basic PSM. Therefore both components of the ML architecture are useful. The PSM is powerful because data can be collected from every user of the system, and broad generalizations about student performance can be drawn. The adjustment functions are beneficial since they capture information about the students that is not stored in the student model. We have shown that our simple adjustment functions, learned on a per student basis, are capable of improving the performance of the network. However, there is considerable room to improve our learning architecture’s performance. A next step in our research is to explore adding additional data to the student model that may explain why some students solve problems more quickly/slowly than others. Additionally, perhaps

our adjustment functions are too simple. A linear regression model uses both addition and multiplication, and may be more accurate than our functions. Prior work with regression models showed they are not powerful enough to learn information from a population of students, but were a good choice for learning about individuals [1]. Our plan is to use this model of time to solve a problem in the next version of the MFD tutor. Details such as a “good” length of time for a student to spend on a problem, and how this depends on her self-confidence and abilities are still unknown. Perhaps another learning agent could be added to the tutor that would observe the students and determine the optimal problem length. Eventually however, system designers must make some decisions themselves. Although applied to the domain of simple arithmetic, this learning architecture should be applicable to a variety of skill-based domains. Most procedural skills are decomposable into components, and if the tutor can rate the student on these component skills, it is straightforward to use our learning architecture. An open question is how well a PSM transfers to other populations, i.e. would a model built from students in an urban school work in a rural school system? We will deploy future versions of MFD to a larger variety of school systems both to test this hypothesis, and to gather more data to permit us to construct a stronger PSM.

Acknowledgements We acknowledge the contributions of Dave Hart and Carole Beal with designing and conducting the evaluation studies, and the assistance of Mia Stern in the design and construction of the MFD system. This work is supported through the National Science foundation, under contract HRD-95555737. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

References 1. Beck, J. and Stern, M. and Woolf, B.P.: Cooperative Student Models. In: Proceedings of the Eighth World Conference on Artificial Intelligence in Education. (1997) 127–134 2. Beck, J. and Stern, M. and Woolf, B.P.: Using the Student Model to Control Problem Difficulty. In: Proceedings of the Seventh International Conference on User Modeling. (1997) 277–288 3. Beck, J. and Woolf, B.P. and Beale, C.: Improving a student’s self confidence. Submitted to the Fifteenth National Conference on Artificial Intelligence (1998) 4. Chiu, B and Webb G.: Using C4.5 as an Induction Engine for Agent Modelling: An Experiment of Optimisation. Machine learning for user modeling workshop at the Seventh International Conference on User Modeling (1997) 5. Eccles, J.S. and Wigfield, A. and Harold, R.D. and Blumenfeld, P. Age and gender differences in children’s self and task perceptions during elementary school. Child Development. Vol. 64. (1995) 830–847 6. Mitchell, T.: Machine Learning. McGraw Hill Text (1997) 7. Quafafou, M. and Mekaouche, A. and Nwana, H.S.: Multiviews Learning and Intelligent Tutoring Systems. In: Proceedings of Seventh World Conference on Artificial Intelligence in Education (1995)

Suggest Documents