Using Fine-Grained Interaction Data to Improve Models of ... - DFKI

5 downloads 3076 Views 309KB Size Report
Keywords. Educational Data Mining, Machine Learning, Statistical Mod- eling. 1. ... operation, which brings up a corresponding template. Un- desired lines can ...
Using Fine-Grained Interaction Data to Improve Models of Problem Solving ∗ Eric Andrès

Sergey Sosnovsky

DFKI Saarbücken

DFKI, Saarbücken

[email protected] Lenka Schnaubert Technische Universität Dresden

[email protected] ABSTRACT Many of today’s Intelligent Learning Environments record learner interactions. Educational Data Mining and Learning Analytics provide methods to interpret these data and eventually build models of learner behavior. The detail level, at which interactions are tracked by the learning environment, influences these models. In this paper, we contrast models predicting learner behavior in fraction-addition tasks built on two data sets of different detail levels. One set contains only data recorded by a typical learning environment, the other is enhanced with the preliminary data collected using a new interface for entering learner solutions step by step. Our results show that even the raw (not interpreted) finergrained interaction data collected by this interface can yield a gain in predictive power. We plan to continue research on this interface by enhancing it with semantic information produced by external tools, such as domain reasoners.

Categories and Subject Descriptors I.2.6 [Learning]: Parameter learning; K.3.1 [Computer Uses in Education]: Computer-assisted instruction.

Keywords Educational Data Mining, Machine Learning, Statistical Modeling

1.

INTRODUCTION

The activity logs produced by learners solving interactive exercises within an intelligent learning environment (ILE) provide a rich source of information that can be used to ∗This work was supported by the DFG - Deutsche Forschungsgemeinschaft under the ATuF project ME 1136/8-1 and NA 738/10-1

[email protected] Susanne Narciss Technische Universität Dresden

[email protected] interpret learner’s actions, model the cognitive state of the learner and predict his response to the planned instructional intervention. The granularity of such information is an important factor, as it defines the precision of learner modeling, and ultimately, the accuracy of adaptation the system can provide. However, recording every learner’s click (as opposed to one log event per submitted answer) does not necessarily result in collecting useful and interpretable information. Maintaining the balance between the meaningfulness of log events and the precision of tracing learner activity is one of the challenges of ILE design. The field of learning analytics can be especially helpful here, as it allows to take a retrospective look at the system logs and datamine usage parameters that can be potentially informative for building an effective learner model for the next system configuration. In this paper, we present a new user interface for multi-step problem solving and investigate how it can be used to enable new possible sources of information about learner behavior. We contrast models built with and without this additional information. We call the interface Structured Templates for Exercise Progressive Solutions (STEPS)[3]. It allows learners to enter their solutions to multistep interactive exercises step-by-step, at the granularity they prefer. Interaction granularity plays an important role in the interpretation of learner actions, as a recent study by VanLehn [12] has shown. The goal of the analysis described in this paper is to derive predictive models for important outcomes occurring when students work on exercises, among others: • performance: has the learner solved an exercise correctly? It is interesting to try to predict this using structural information only, as it would be a cheap way to bypass domain modeling. • confidence: is the learner confident that the submitted solution has been correct? In our previous work, we found that self-provided confidence judgements may be helpful in predicting important learning behavior such as disengaging from a exercise[11]. However, the process of requesting confidence judgements for each ex-

ercise is too intrusive from the experimental point and ineffective instructionally, as it requires from learners to constantly switch between the off-task and on-task cognitive activities. The capability to reliably predict response confidence would benefit the learner modeling components of a system without adding an extra cognitive burden on a learner. • certain combinations of performance and confidence. A case of particular interest is the situation when a learner provides an incorrect solution while having high confidence in his answer being correct. We call this dangerous knowledge[7]. Figure 1: Screenshot of the STEPS interface We expected that the inclusion of finer-grained data generated by STEPS would increase the predictive power of models inferring such learner parameters.

2.

STRUCTURED TEMPLATES FOR EXERCISE PROGRESSIVE SOLUTIONS

A common problem of current ILEs is that their interfaces for entering learners solutions use a fixed interaction granularity. This is especially important for exercises that require from a learner to compute several steps before hi comes to final solution. The two standard approaches are: 1) expecting a learner to complete all the intermediate steps outside the system and submit only the final step response (e.g., quizzes in learning management systems), or 2) providing him with one or more solution paths of pre-defined intermediate steps (e.g., Math-Tutor exercises1 ). Structured Templates for Exercise Progressive Solutions (STEPS) is a new approach built into to the Web-based intelligent learning environment ActiveMath[9] that lets learners input the solution at their own pace by building individual solution paths at the granularity they like. Figure 1 shows a screenshot of a STEPS exercise in progress. A learner can add new intermediate lines using the + new step button. The drop-down list is used to select the desired operation, which brings up a corresponding template. Undesired lines can be removed using a - delete step button. Existing lines can be modified. This approach allows to decouple the learner’s intent (selection of the operation) from the actual execution (filling the template blanks). A reliable interpretation of learning events in this interface would allow for very focused learner support, e.g. a feedback message targeting not only the individual step, but also the initial choice of the operation within the step, or a particular field within the step template. This is particularly important for novice learners, because they tend to favor receiving step-based feedback [1]. At the same time, the separation of operation declaration and execution helps the system to interpret learner’s intent, as he specifies it by selecting the operation for the step. We instrumented the interface to log all possible learner interactions with the system, including: addition/removal of lines, input or modification to blanks, selection of an operator in the drop-down list. An example of such a user input event is shown in fig. 2. The complete state of the STEPS interface is captured in the attribute inputData, 1

http://mathtutor.web.cmu.edu/

status="final" indicates that the learner submitted this solution state to ActiveMath. 2012-02-23 09:40:00 MEZ Figure 2: A sample STEPS user input event From previous work, we know that response confidence is an important factor to consider when analyzing learner behavior [10]. In this experiment, we combined STEPS interface with a little widget that required the learners to provide a mandatory confidence judgement using a radio button scale as show on the bottom of fig. 3.

3.

METHOD

The dataset we have used was collected during a study on fraction addition run at Dresden university of technology in spring 2012. The study was designed to include 5 phases: pre-questionnaire, pre-test, learning, post-test and post-questionnaire. 6th- and 7th-grade pupils from Saxonian schools were recruited through various advertisements; they were rewarded 10 Euro for participation. The pupils were invited to the psychology of learning and instruction lab. The data was collected in a series of sessions with about ten pupils simultaneously working on ActiveMath, supervised by two researchers. Overall, logfiles from 124 participants were obtained. For this paper, we use a subset of the original data gathered from 3 post-test exercises requiring learners to add fractions with unlike denominators using STEPS. After cleaning this dataset (some learners had technical problems), we were left with 351 exercise attempts performed by 121 learners. We also collected motivational information about the pupils using a pre-questionnaire, in particular their intrinsic motiva-

we wanted to explore the impact of individual parameters. Another strength is that MARS models are more flexible than linear regression models but still are relatively easy to interpret (e.g., in contrast to support vector machines). For each target response, we trained two MARS models: one using only basic data accessible without STEPS, and one additionally including data specific to the usage of STEPS. The basic set of predictors included the following:

• pretest score • intrinsic motivation • perceived competence • time spent on the exercise

Figure 3: Screenshot of a STEPS exercise with confidence input

tion and their perceived competence represented as scores ranging from 0 to 100. Furthermore, we used the data from the pre-test to compute learners’ initial mastery for adding fractions with unlike denominators. The pre-test contained three exercises on this topic. The initial mastery was estimated as the average score for these exercises. Table 1 lists all the attributes used in this analysis. We have computed predictive models for learner performance, learner confidence and dangerous knowledge. These target dependent variables are modeled as binary outcomes. For the learner performance, we observed whether the exercise has been solved correctly or not. The learner confidence was collected as a score on a 9-point discrete scale ranging from the confidence in the solution being wrong (-4) over uncertainty (0) to the confidence in the solution being correct (4). Dangerous knowledge is modeled as a particular combination of the two preceding variables: if a submitted solution is wrong,yet the learner has a high confidence that it is correct, we treat this as a case of dangerous knowledge. The upper third of the confidence scale (2,3,4) was considered a high confidence in the answer being correct. To construct and evaluate the models, we split the dataset in a 70% training and 30% validation set, stratified by the target response to predict. We used multivariate adaptive regression splines (MARS)[5] for classification. Additive MARS models are piecewise linear functions of the form m(x) = Pn c h i i (x) + c0 , where c0 is the intercept, the ci are coni=1 stant coefficients and the hi are hinge functions. These hi take the form hi (x) = max(0, si (x − ki )), si ∈ {−1, +1}. They are used to partition the data, for instance h(x − 5) is only non-zero when x > 5 (lower bound), while h(5 − x)is only non-zero for x < 5 (upper bound). An interesting feature of the MARS algorithm is that it automatically selects features important for the model, which makes it a suitable choice for the exploratory analysis described in this paper. For the analysis described in this paper, all MARS models were built as additive models, as

We wanted to explore the impact of raw fine-grained information generated by STEPS on the models. Hence, the STEPS set extended the basic set by these simple additional attributes:

• number of lines used in the solution • number of inputs provided • number of changes to previously typed inputs

To compare the relative effectiveness of the trained models, we built a receiver-operating characteristic curve for the observed values in the validation set against the predicted values for each model. The area under the ROC curve can be used to compare the predictive power of models [4] (as cited in [2])

4.

RESULTS

We trained three pairs of models, one pair for each target response. In this section, we present those models together with the measures of their predictive power. Models are written as Mst , where t is the target predicted value (perf, conf or dangerous) , and s is the set of predictors used: base or steps. Predictor abbreviations are detailed in table 1. Coefficients are rounded to the 3-digit precision. The models we obtained are presented below. Performance: Mbase perf = 0.805 + 0.325 · h(score − 1.5) − 0.171 · h(2 − score) − 0.007 · h(97 − t) Msteps perf

= 0.658 + 0.400 · h(score − 2) − 0.226 · h(2 − score) + 0.319 · h(numLines − 4)+ 0.036 · (numEdits − 3) + 0.023 · h(t − 80) − 0.04 · h(t − 97) − 0.004 · h(97 − t) + 0.015 · h(t − 147)

Table 1: Collected data attributes Learner attributes Name

Description

Abbreviation

Learner Id Pretest Intrinsic Motivation Perceived Competence

The learner’s score on the pretest The learner’s intrinsic motivation on a 0-100 scale The learner’s perceived competence on a 0-100 scale

score IA POC

Name

Description

Abbreviation

Exercise Id Duration Response confidence Solved

Time spent on the exercise in milliseconds Discrete scale from -4 to +4 true if the exercise was solved correctly

t

Name

Description

Strategy Number of Lines Number of Inputs Number of Edits

List of chosen operators, e.g., expand,add Number of lines of the learners solution Number of learner’s inputs, e.g., filling an empty blank Number of changes to blanks already filled by the learner

Basic attributes

STEPS attributes

numLines numInputs numEdits

Confidence: Mbase conf = 0.697 + 0.004 · h(P OC − 26, 667) Msteps conf = 0.767 + 0.003 · h(P OC − 26, 667)+ 0.651 · h(2 − numLines) − 0.155 · h(8 − numInputs)+ 0.144 · (3 − numEdits) Dangerous Knowledge: Mbase dangerous = 0.396 − 0.741 · h(score − 1)+ 0.626 · h(score − 1.5) + 0.044 · h(95 − t) = Msteps dangerous Evaluation of the predictive power using receiver-operating characteristic (ROC) curves shows that for each target response, the inclusion of STEPS-specific features leads to models with a higher area under the curve (AUC), although the increase is rather marginal. Table 2 lists the AUC values for the preliminary models along with the p-values for the one-tailed DeLong test used to check if the AUC of the STEPS model is significantly higher than the AUC for the base model.

5.

DISCUSSION AND FUTURE WORK

Table 2 shows that the performance and the confidence models including STEPS data perform marginally better than the basic counterparts. For dangerous knowledge, the MARS algorithm generates the same model in both conditions. In what follows, we describe each of the models in more detail. Performance. The model for performance including the STEPS parameters has only a marginally higher AUC (.8811) than the base model (.8800), this difference is non-significant.

Figure 4: ROC curves for performance models (base model is dashed) Table 2: AUC for preliminary models Target

Predictors

AUC

Performance

base STEPS p (DeLong)

.8800 .8811 .4915

Confidence

base STEPS p (DeLong)

.6289 .6861 .0686

Dangerous Knowledge

base STEPS p (DeLong)

.6537 .6537 .5

However, it is worth mentioning that the base model’s AUC of .8800 is already fairly high, which results in a ceiling effect. A closer analysis shows that in both models the pre-test score is the main factor: the positive coefficients of the terms steps h(score − 1.5) in Mbase perf and h(score − 2) in Mperf indicate that a high pre-test score is a good predictor for a relevant post-test exercise to be solved. This has been expected; since the models predict whether the learner would solve an exercise, it is natural that they strongly rely on the measure characterizing this learner’s exercise solving history.

Figure 5: ROC curves for confidence models (base model is dashed)

The STEPS-enriched model suggests that the amount of lines in the solution is positively associated with successfully solving the exercises, starting from the fifth line. Editing activity is also considered to have a positive impact. The model also includes 4 terms related to timing information (with both, positive and negative coefficients). This suggests that time-on-task is a parameter that deserves further attention - using STEPS data, more specific time-based information can be derived. The work by Jaruˇsek and Pel´ anek, e.g. using logarithmic scaling and group invariants [8], provides a starting point to identify methods applicable to fine-grained timing accessible using STEPS. Even though the additional STEPS data brings about only marginal improvement in the modeling quality, the STEPS data we use so far (such as number of lines and edits) are only raw characteristics of a submitted solution. We expect that enriching it with more meaningful information characterizing the knowledgebased dimension of the learner solutions (e.g. demonstrating a misconception, systematic falling to apply a prerequisite concept, choosing an ineffective solution path, etc.) should result in more significant gains in predictive power of STEPS-using models. Confidence. Both confidence models have rather low AUC values. However, the STEPS-enhanced version surpasses the base version by more than 0.05 in terms of AUC. This difference is marginally significant with p = .0686. The base model includes only the intercept and the POC parameter with a low coefficient as a predictor. The coefficient is so small that the impact of POC vanishes; high confidence is constantly predicted in all cases. In contrast, the STEPS model additionally includes all three parameters provided by STEPS: number of lines, inputs and edits. Single-line solutions have a strong positive impact on the prediction (+0.65). A small number of edits is positively associated with confidence, while a small number of inputs has a negative impact.

Figure 6: ROC curve for dangerous knowledge models (base and STEP models are identical)

Msteps conf reflects an intuitive judgement: Complete and terse solutions requiring few corrective interventions are likely to be accompanied by a high confidence degree. However, solutions, for which learners reported uncertainty or confidence in the answer being wrong, were often short and complete as well. With no additional information on the characteristics of the lines, the MARS algorithm cannot produce models that discriminate between these two cases, thus leaning towards the most frequent outcome: in almost 75% of the cases, learners reported high confidence in the correctness of the answer. We believe that with additional information, such as the correctness of a step or the outcome of an editing action, the quality of modeling could be substantially improved. To this end, we plan to enhance STEPS with tools

for advanced user input interpretation, such as the IDEAS domain reasoners [6]. Dangerous Knowledge. Since dangerous knowledge is a combination of the two other target variables, we expected its model to rely on STEPS information. Nonetheless, the result, was surprising: the MARS algorithm selected only parameters from the basic set for the both models, suggesting that the raw STEPS information is useless in predicting cases of dangerous knowledge. steps Mperf

[8]

[9]

[10]

steps Mconf

A possible reason for this fact is that and make opposite use of the number of lines and edits, while the number of inputs has little to no impact. This once again emphasizes the need for more meaningful interpretation of STEPS data beyond trivial activity-based information. These preliminary findings confirm that the inclusion of data specific to STEPS can be sometimes helpful to build better models for interesting learning phenomena. However, the predictive power of the obtained models is not sufficient to use them for automated pedagogical decision-making. We plan to continue working on this project in the following directions: • fine-grained timing analysis; • usage of aggregate values derived from STEPS attributes; • use of external tools to automatically annotate STEPS input with correctness information. We expect these yo be promising paths to further improve the predictive power of the models.

6.

REFERENCES

[1] G. Corbalan, F. Paas, and H. Cuypers. Computer-based feedback in linear algebra: Effects on transfer performance and motivation. Computers & Education, 55(2):692–703, Sept. 2010. [2] M. C. Desmarais and R. S. J. d. Baker. A review of recent advances in learner and skill modeling in intelligent learning environments. User Modeling and User-Adapted Interaction, 22(1-2):9–38, Oct. 2011. [3] A. Eichelmann, S. Narciss, L. Schnaubert, E. Melis, and G. Goguadze. Design und Evaluation von interaktiven webbasierten Bruchrechenaufgaben. DeLFI 2011 – Die 9. e-Learning Fachtagung Informatik, pages 31–42, June 2011. [4] J. Fogarty, R. S. Baker, and S. E. Hudson. Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction. pages 129–136, 2005. [5] J. H. Friedman. Multivariate adaptive regression splines. The annals of statistics, pages 1–67, 1991. [6] A. Gerdes, B. Heeren, J. Jeuring, and S. Stuurman. Feedback services for exercise assistants. The Proceedings of the 7th European Conference on e-Learning, pages 402–410, 2008. [7] V. Jans and D. Leclercq. Mesurer l’effet de l’apprentissage a ` l’aide de l’analyse spectrale des performances. In C. Depover, editor, L’´evaluation des

[11]

[12]

comp´etences et des processus cognitifs : mod`eles, pratiques et contextes, pages 303–317. De Boeck & Larcier, 1999. P. Jaruˇsek and R. Pel´ anek. Analysis of a simple model of problem solving times. Intelligent Tutoring Systems, pages 379–388, 2012. E. Melis, G. Goguadze, and P. Libbrecht. Activemath–a learning platform with semantic web features. The Future of Learning, 2009. L. Schnaubert, E. Andr`es, S. Narciss, A. Eichelmann, G. Goguadze, and E. Melis. Student behavior in error-correction-tasks and its relation to perception of competence. In C. Delgado Kloos, D. Gillet, R. Crespo Garcia, F. Wild, and M. Wolpers, editors, Towards Ubiquitous Learning, Proceedings of the 6th European Conference on Technology Enhanced Learning, EC-TEL 2011, pages 370–383, 2011. L. Schnaubert, E. Andr`es, S. Narciss, S. Sosnovsky, A. Eichelmann, and G. Goguadze. Using Local and Global Self-evaluations to Predict Students’ Problem Solving Behaviour. 21st Century Learning for 21st Century Skills, pages 334–347, 2012. K. VanLehn. The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems. Educational Psychologist, 46(4):197–221, Oct. 2011.