Estimating Programming Knowledge with Bayesian Knowledge Tracing Jussi Kasurinen, Uolevi Nikula Lappeenranta University of Technology P.O. Box 20 FI-53851 Lappeenranta +358 5621 6641
jussi.kasurinen,
[email protected] ABSTRACT In this paper we present a concept for three-phase measuring method, which can be used to obtain data on student learning. The focus of this method lies on the technical aspects of learning programming, answering questions like which programming constructs students applied and how large portion of the students understood the concepts of programming language. The model is based on three consecutive measurements, which are used to observe the student errors, applied programming structures and an application of a Bayesian learning model to determine the programming knowledge. So far the model has produced results which confirm prior knowledge on student learning, indicating that the concept is feasible for further development. Despite of the early development phase of the method, it offers a straightforward way for teacher to assess the course contents and student performance.
Categories and Subject Descriptors K.3.2 [Computer and Information Science Education]: Computer Science education
General Terms Algorithms, Measurement, Performance
Keywords Programming knowledge, error analysis, knowledge tracing
1. INTRODUCTION Programming as a discipline is difficult to learn and apply successfully, as it has several concepts and structural rules which have to be understood before anything relevant can be accomplished. In fact, common consensus is that mastering the field takes at least ten years of professional experience [18], and even then the expertise is somewhat limited to a group of certain programming languages and approaches.
As for a novice programmer, the learning process is usually started by learning the basic structures such as iteration, variables and logical expressions, which are then extended to functions, modules and objects. After learning the structures, the students are taught to combine these “building blocks” to create functional frameworks. The refinement of these frameworks then continues to grow more and more complex, until larger functionalities and more complicated programs can be created. However, as the complexity of the programming assignments increases, the student submissions start to indicate an interesting phenomenon [3]: as most of the programming languages offer several redundant structures to implement functionalities like iteration, the students seem to favor some structures over the others, even to the point where the personal preferences in program implementation supersede practicality [3,16]. Besides practical issues, the programming habits may have undesirable characteristics [12]: files are left open, exception handlers are missing and in general shortcuts are taken to create desired outcome without any regard on the good programming manners. To understand these issues better, our research collected and analyzed student performance data. Our objective was to measure which concepts were difficult to the students, how well the different programming structures were understood and in general, to trace student learning in the fundamental programming course. For these objectives, we created a method for the analysis: for the first phase the student errors were analyzed statistically to establish basic understanding of the problematic programming concepts. For the second phase, the applied programming structures from the student submissions were identified and statistically analyzed to observe which structures students used commonly. In addition to these phases, for third phase we applied the Bayesian Knowledge Tracing algorithm [2] to the structure data to model the learning process and measure final programming knowledge. The process also makes it possible to monitor the student performance during the course, enabling teacher to monitor to the difficulties and adjust the course contents accordingly. The rest of the paper is structured as follows. Section 2 discusses prior research on similar topics, and Section 3 introduces the research and data collection methods applied in this study. Section 4 introduces the data and the data analysis phase. Section 5 focuses on the results and Section 6 discusses the limitations of the method. Section 7 closes the paper with conclusions.
2. RELATED RESEARCH The structural analysis of novice programmer code has been studied earlier on several occasions. For example, Soloway et al. [16] discuss the concept of different levels of programming plans and frameworks. Students learn to program by creating mental models to solve different problems and by combining these models to implement larger functionalities. The study also suggests that the students have a tendency to not recognize the problem they are actually trying to solve, causing them to apply ill-suited structures. Students also seem to avoid unfamiliar concepts, even discarding some of the fundamental structures. Spohrer et al. [17] discuss a similar scheme, in which the student programs are deconstructed to a model of different programming constructs. The model abstractly represents the solution without programming language-specific features. Similar feature extraction scheme has later been used by Carter et al. [5] and Mierle et al. [14], in which the student knowledge was assessed by analyzing the applied programming constructs. In the context of learning, Linn [11] defines the cognitive accomplishments of learning programming. By this definition, the first accomplishment is learning the basic structures, while second is the ability to solve abstract problems. In a larger scale, for example Bloom’s taxonomy [4] defines the overall learning process as a group of phases, in which the student applies the knowledge to achieve different objectives, like comprehension-, application- and analysis skills. The student understanding of programming has been discussed in several recent studies. A review of these studies by Simon et al. [15] points out that the student understanding has been measured several ways, including writing, reading or tracing the source code. The same study also discusses the relevance of the first impression in programming. If the threshold of connecting the previous understanding to the new programming constructs is too high, it hinders the learning process and should be actively avoided. In addition of structural analysis, the measurement of actual programming knowledge has been studied. The Bayesian Knowledge Tracing (BKT) algorithm is an estimation algorithm that has been applied in several studies [2,6,7,9] to model the student learning process. Especially Jastrzembski el al. [9] discusses the possibilities of using Bayesian approach to predict student performance and steer training process to enhance learning. In general, the BKT model featured in Figure 1 applies prior knowledge (Ln) and the probability to learn the applied concept (p(T)) to measure the progress of student learning. During the iterations, the algorithm also models the possibility to guess the answer (p(G)) or to slip (p(S)), which means making an error while understanding the concept. Unlearned Invalid
p(T)
1-p(L0)
3. ANALYSIS MODEL The test data for our research was collected from the Fundamentals of Programming -course held at Fall of 2007. To establish understanding of the student group, the prior programming experience was surveyed at the beginning of the course with a starting survey. The data collection was conducted with a virtual learning environment that recorded the student-implemented their submissions with local programming tools, and then tested, finalized and submitted their programming exercises to virtual learning environment. The virtual learning environment recorded the errors encountered during the testing and finalizing processes and the final, accepted submission to the database. Data collection 1. Collection
2. Filtering
Analysis phases For the error log
For the submission log
3.Error analysis
4.Structure analysis
5. BKTmodeling
Figure 2:Data collection and analysis phases As shown in Figure 2, the actual analysis consisted of three phases. First, the error log was statistically analyzed for recurring errors or unusual distribution of error types. Secondly, the submission database entries were filtered based on the applied structures, with a method similar to Mierle et al. [14]. This enabled statistical comparison of different types of applied structures in different exercises to do a structure comparison as in Soloway et al. [16]. Also based on the filtered data, the BKTmodel was applied to estimate the learning probability for each student separately. This phase was based on the suggestions and notions by Jastrzembski el al. [9] and Baker et al. [2].
Valid p(L0)
p(G) Valid
Learned
The Bayesian Knowledge Tracing algorithm is also discussed by Gunzelmann and Gluck [8]. They acknowledge the BKT algorithms to have a “history of success” in programming and algebra. Their study also criticizes the model for not implementing forgetting, and discuss the learning model by Anderson and Schunn [1]. However, this ACT-R (Adaptive Control of Thought – Rational) model is a much larger concept, including cognitive psychology and -development, and is geared towards long-term learning, skill acquisition and deterioration. In terms of applicability, Gunzelmann and Gluck [8] suggest that the BKT model is sufficient for skill mastery estimating, while ACT-R is better suited for “advanced studies” in long term learning, where forgetting and experience are major components.
p(S) Invalid
Figure 1: Bayesian Knowledge Tracing diagram
4. CASE STUDY 4.1 Reseach Process After the data collection was complete, the database contained data from 133 students, with total of 2912 accepted Python source codes and 9420 logged errors. The simplicity of some of
the exercises prevented us from using all of the accepted submissions. Out of the 41 assignments, only 12 were considered complex enough to qualify for the analysis. These exercises were from the latter parts of the course with 40-50 lines of code (LOC), enabling students to pursuit individual strategies to solve the presented programming problem. We also rejected five exercises which were based on our visualization tool Turtlet [10] since these exercises had detailed operational descriptions and pseudo code presentations, making them too refined for our study. Finally the structure analysis was conducted with 675 source codes, even though the error analysis data was still conducted to the full database of 9420 errors, collected from all of the source code submissions. For the BKT-modeling, the structures were compared to the preferred solutions defined by our research team. The preferred solutions were designed with the following guidelines:
The Ln value after the last iteration presents the probability of student learning the tested structure. The confidence level of student mastering (ie. understanding) the structure was set to 0.95 based on the paper by Corbett and Anderson [7]. This roughly translates to requiring three valid uses of tested structure in a set of five exercises before student is classified to master the subject.
4.2 Study Results The error analysis was used to examine if there was any consistent error type that could be associated with a certain type of programming structure. If the accumulation of certain type of error would have been significant, this could have explained the reason why students avoided some programming structures. 4500
4270
4000 3500
The BKT algorithm calculates the probability of student learning a certain programming structure. If a structure, for example exception, was expected to be applied in five exercises and student used them correctly only in first and fourth, this would have lead to an analysis string of “10010”. The new learning probability Ln was then calculated with the Bayesian probability equation (1), where (2) was used if structure was used correctly and (3) if used incorrectly or not implemented. (1) Ln = p(Ln-1|Rn) + ( 1 –p(Ln-1|Rn) * p(T) ) where (2) if 1: p(Ln-1|Rn) = [(1-p(S)) * p(Ln-1)] / [( (1-p(S)) * p(Ln-1)) * ( p(G) * (1- p(Ln-1)) )] (3) if 0: p(Ln-1|Rn) = [p(S) * p(Ln-1)] / [( p(S) * p(Ln-1) ) * ( (1 - p(G)) * (1- p(L n-1)) )]
2500 2000
1536 1193
1500 1000
659
500
413
357
335
311
183
155
8 t
0di v
Im po r
U nb ou nd
IO
Va lu e
In de x A ttr ib ut e
Ty pe In de nt at io n
x
0
Figure 3: Total amount of errors by type encountered while testing and finalizing As Figure 3 illustrates, the error types related to different constructs – type, index, attribute, IO, unbound and value – do not differ significantly, while syntax-related errors – syntax, name and indentation – cause the majority (68.6%) of the errors. This does not give us any information on the programming constructs, but a clear indication that the most common problem for students was expressing themselves with the programming language. These results established the basic understanding of the programming difficulties, indicating that the structural analysis was needed to analyze the programming knowledge. The structural analysis was then used to analyze which types of programming structures the students applied. First analyzed structure was the file handling procedure. The procedure is fairly simple two-parted operation, from which the latter part –closing the file –is one of the most easily forgotten parts. % of submissions
Based on the prior study by Baker et al. [2], the algorithm scalars were selected to use value p(G) = 0.30 for guessing the right answer and p(S) = 0.10 to make a mistake. Therefore the p(T) = 1-p(G) = 0.7, meaning that by applying the correct structure there is a 70% chance that student understood (learned) to apply the structure correctly. Additionally, the initial L0 was calculated from the starting survey data. If the student had previous programming experience the student was considered to have enough prior knowledge. This way the L0 value, the probability for student to understand concept before the first iteration, was set to 0.35.
3000
N am e
While should be used when the number of iterations is not known beforehand, otherwise for. • Every input should be taken as a raw_input and converted to a suitable type. • Each file opening to read a file should have an exception handler. • Each open file should be closed. • Functions (def) should be used when implementing a recurring action outside iterative structure. As Python programming language has only one structure for logical expression (if-elif-else), it was discarded from the test. Similarly, the recursion was not tested as it was not included to the course contents.
Sy nt a
•
100 80
92
95
100
94
92
60 40
48 41
20 0
10/1 10/2 11/1 11/2 12/2 13/1 13/2 Week/Assignment
Figure 4: Amount of students who closed opened files Figure 4 illustrates the file handling procedure results, which suggests that the overall structure was understood pretty well, although in two assignments the closing was forgotten. First error-prone exercise was week 11 exercise 1 (11/1 in Figure 3), which had an existing template on which the students had to
% of submissions
implement functionalities. On the second assignment, 12/2, the assignment was the only exercise during that week that applied file handling, suggesting that the students were not yet accustomed to the file handling operations. Although these results are excellent, 100% high with 80.3% average, the second analyzed structure, exception handling, proved to be more troublesome. 100 80
73
71
60
54
40
23
22
20
5
0
8
10
10/2 11/1 11/2 12/2 13/1 13/2 13/3 14/2 Week/Assignment
Figure 5: Students who implemented exception handlers As Figure 5 shows, most of the students left the exception handling routine out from their solutions. Manual confirmation of this result also indicated that the students avoided exception handling, usually circumventing it with logical expressions. In addition to these results, it should also be mentioned that the submissions were not tested with faulty inputs as these tests would have required the students to implement exception handling, making this test biased. The iteration structures were also analyzed, as they are one of the most fundamental structures in the programming. Based on the analysis data, the two iterations in Python, for and while, were applied casually: statistically the similarity in results also indicated that the places where they were applied were also mostly understood, as only in one occasion at week 13 the majority differed from the expected solution. Another interesting observation was found using the structural analysis. On some exercises (13/3 and 14/2) the total amount of iterative structures did not add up to 100% or over. This was caused by an unexpected observation: some students had created a custom iteration type from exception handler and recursive function call to enforce input validation. This iteration type was found in 27 source codes, created by 22 different students.
% - of students
The BKT-modeling was used to calculate the knowledge estimates based on the student submissions. Figure 6 presents the percentage of student population exceeding the original confidence level of 95%, and for comparison, with 90% confidence. Based on the analysis, approximately half of the students understood all the topics within the course syllabus, even though they rarely applied some of the more complex structures. 100 90 80 70 60 50 40 30 20 10 0
58 63
While
48 62
For
67 67 46 48
49 52
Functions
Exceptions
Threshold p= 0.05
Files
Theshold p = 0.1
Figure 6: BKT-test for analyzed structures
5. DISCUSSION Obviously the results are subject to the critical evaluation. The Bertels et al. [3] describe several concerns – or clichés – which apply to the structural analysis. For example, in some assignments, the preferred structures are less obvious than in others, meaning that the desired approach may be beyond the scope of novice students. Similarly, as the complexity of the required programming tasks increases the individual differences in learning abilities, prior experience and desired reusability for the code in the future becomes a concern. Finally, the prior experience on programming also affects the approach to problem solving. For example, function as a structure is a useful tool, but the novice programmers may not see the benefit of applying it as often as more experienced programmers would. Especially if there is an easier, possibly more laborious, solution to the problem, it is possible that novice programmers may use it. This mindset leads to solutions where several easy, well-understood, structures are combined instead of applying the complex but direct problem-solving structure. The validity of error analysis is dependent on the accuracy of the Python interpreter as the errors are classified by the report obtained from stderror output. Python interpreter seems to be reasonably accurate in error classification, but there is no research data on this subject. However, the error analysis is used only to establish basic understanding and observe trends, not in detailed analysis. The scalars selected for the BKT-analysis modeled the student population as one homogenous group, disregarding the individual differences mentioned in some of the earlier studies [6,7]. Although it is realistic to believe that the BKT-analysis underestimated some individuals, for estimating the general understanding the model concept is sufficient. In addition, the model can be criticized as it does not take concept of forgetting already learned structures into account. However, the time-span measured is 14 consecutive weeks, so forgetting something actually understood is not very high. Therefore the BKTalgorithm did suite for the analysis, although it has limitations for the applicability in a longitudal study. From the Bloom’s taxonomy [4] point of view, the student abilities mostly related to the knowledge and comprehension. In some cases, the analysis indicated that some students had the ability to apply given data to new context, but this behavior was not consistent. As for the structure selections, the majority of students seemed to avoid advanced structures like exceptions and functions as they were rarely applied if they not crucial for the exercise, but in the first programming course this is understandable. The practical results from our three-phase analysis method support the findings of previous studies. Students have difficulties writing source code and they do have problems with the more complex programming structures. For the analysis method, this proof-of-concept model therefore seems sensible. Obviously the method-produced results still require additional research for validation, but as a concept this model enables us to enhance the exercises and bring focus to the problems students struggle with, while simultaneously observing the technical soundness of the exercises and overall progress.
6. CONCLUSION The student source code submissions and error logs were analyzed to investigate the programming knowledge after the first course on programming. The results indicated that several students had limited programming knowledge, and to a lesser degree, problems in expressing themselves with the programming language. Based on the data this analysis was created from, it is clear that the students do not understand sufficiently the programming structures. To address this issue, the results of this analysis can be used to revise the course assignments and track student performance more closely. For now, the concept for analysis seems reasonable, giving the teacher more insight on student progress and problem areas. We are satisfied with the proof-of-concept results gained from applying the method to the existing data. However, the analysis still needs additional research, and we need to adjust our exercises to better suit the BKT-analysis. Reconsidering the assignments with BKT-analysis in mind, we can check that each structure of interest is repeated often enough in the course assignments to get reliable data to interpret and analyze. In addition, the instructions on when to apply different structures, or how to solve certain problems, should be promoted in the course contents. The motivational impact of enforcing certain structures instead of encouraging self-designed implementations should also be considered. For future work, our concept needs tuning and additional testing for validation. The analysis also needs a methodological triangulation in relation to similar concepts. The BKT-algorithm could also be developed further to support individual differences. One long-term goal could possibly be to integrate the analysis tool to an existing learning environment, using the results to offer students individually tailored assignments.
7. REFERENCES [1] Anderson, J.R. and Schunn, C.D. 2000. Implications of the ACT-R Learning Theory: No Magic Bullets. Advances in instructional psychology, Vol. 5. [2] Baker, RSJ, Corbett, A.T., Aleven, V. 2008, More Accurate Student Modeling through Contextual Estimation of Slip and Guess Probabilities in Bayesian Knowledge Tracing. Proc. of the 9th International Conference on Intelligent Tutoring Systems, pages 406-415, Berlin, Germany. [3] Bertels, K.; Vanneste, P.; De Backer, C., 1993. A cognitive approach to program understanding. Proceedings of Working Conference on Reverse Engineering, 21-23 May, Baltimore, MD, USA, 1-7. [4] Bloom B. S., 1956. Taxonomy of Educational Objectives, Handbook I: The Cognitive Domain. New York: David McKay Co Inc. [5] Carter, J., Ala-Mutka, K., Fuller, U., Dick, M., English, J., Fone, W. and Shread, J, 2003. How shall we assess this? ACM SIGCSE Bulletin Vol 35(4), 107-123.
[6] Conati, C., Gertner, A. and Vanlehn, K., 2002. Using Bayesian Networks to Manage Uncertainty in Student Modelling. User Modeling and User-Adapted Interaction 12(4), pages 371-417. [7] Corbett, A.T. and Anderson, J.R., 1995. Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge. User Modeling and User-Adapted Interaction 4(4), pages 253-278. [8] Gunzelmann, G. and Gluck, K.A. 2004. Knowledge Tracing for Complex Training Applications: Beyond Bayesian Mastery Estimates. Proc. 13th Conference on Behavior Representation in Modeling and Simulation, Arlington, VA, USA, pages 383-384. [9] Jastrzembski, T.S, Gluck, K.A. and Gunzelmann, G. Knowledge Tracing and Prediction of Future Trainee Performance. Proceedings of the Interservice/Industry Training, Simulation, and Education Conference, Orlando, Florida, USA, 2006. [10] Kasurinen, J., Purmonen, M. and Nikula, U., 2008. A Study of Visualization in Introductory Programming. Proc. 20th annual Meeting of Psychology of Programming Interest Group, Lancaster, UK, 2008. [11] Linn, MC, 1985. The cognitive consequences of programming instructions in classrooms. Educational Researcher, Vol 15(5), 14-16, 25-29. [12] Mayer, R. E., Dyck, J. L., Vilberg, W., 1986. Learning to program and learning to think: what’s the connection? Communications of the ACM, Vol 29(7), 605-610. [13] McKetihen, K., Reitman, J.S., Rueter H.H. and Hirtle S.C.. 1981. Knowledge Organization and Skill Differences in Computer Programs. Cognitive Psychology 13, pages 307325. [14] Mierle, K., Laven, K., Roweis, S. and Wilson, G., 2005. Mining student CVS repositories for performance indicators. Proc. of the 2005 International workshop on Mining Software Repositories, pages 1-5. [15] Simon, B. Lister, R. and Fincher, S. 2006. MultiInstitutional Computer Science Education Research: A Review of Recent Studies of Novice Understanding. 36th Annual Conference on Frontiers in Education, San Diego, CA, USA, pages 12-17. [16] Soloway, E., Ehrlich, K., Bonar, J. and Greenspan, J., 1982. What Do Novices Know about Programming? Directions in Human/Computer Interaction, 27-54. [17] Spohrer J. C., Pope E., Lipman M., Sack W., Freiman S., Littman D., Johnson L., Soloway E., 1985. Bugs in Novice Programs and Misconceptions in Novice Programmers. Computers in Education, 543-552. [18] Winslow L.E. 1996. Programming Pedagogy –A Psychological Overview. SIGCSE Bulletin, 28, 17-28.