Computer Science Education Vol. 16, No. 3, September 2006, pp. 229 – 240
On the Use of Resubmissions in Automatic Assessment Systems Ville Karavirta*, Ari Korhonen and Lauri Malmi Helsinki University of Technology, Department of Computer Science and Engineering, Finland
Automatic assessment systems generally support immediate grading and response on learners’ submissions. They also allow learners to consider the feedback, revise, and resubmit their solutions. Several strategies exist to implement the resubmission policy. The ultimate goal, however, is to improve the learning outcomes, and thus the strategies should aim at preventing learners from using the resubmission feature irresponsibly. One of the key questions here is how to develop the system and its use in order to cut down such reiteration that does not seem to be worthwhile? In this paper, we study data gathered from an automatic assessment system that supports resubmissions. We use a clustering technique to draw a distinction among learner groups that seem to differ in their use of the resubmission feature and the points achieved from the exercises. By comparing these groups with each other, we conclude that for a small minority of learners there is a risk that they use the resubmission inefficiently. Some learners seem to resubmit the solution without thinking much between two consecutive submissions. In order to prevent such an aimless trial-and-error problem solving method, one option is to limit the number of allowed resubmissions. However, not all resubmissions are bad. In addition, there exist several ways to realize the limitations to achieve the best possible resubmission policy fit for all the students. These are discussed based on the evidence gathered during the research.
1. Introduction Automatic assessment systems (AA systems) (English & Siviter, 2000; Forsythe & Wirth, 1965; Korhonen & Malmi, 2000; Luck & Joy, 1999; Malmi et al., 2004; Naur, 1964; Rees, 1982; Saikkonen, Malmi, & Korhonen, 2001; Tscherter, Lamprecht, & Nievergelt, 2002) have been used in many institutes to aid grading and giving formative feedback on exercises in introductory computing courses, especially programming courses (Ala-Mutka, 2005b). These tools have many benefits compared with traditional teaching methods. The most cited features are that the learner can work at any place with an internet access, and submit the assignment at any time. However, there are other pros and cons, as well. See, for example, the findings of an international survey of current assessment practices (Carter et al., 2003).
*Corresponding author. E-mail:
[email protected] ISSN 0899-3408 (print)/ISSN 1744-5175 (online)/06/030229-12 Ó 2006 Taylor & Francis DOI: 10.1080/08993400600912426
230
V. Karavirta et al.
Even though AA systems have been used rather widely, there have been only a few studies on how students use them, how they affect students’ learning process, and how they should be tuned to best support the learning goals of target courses. Examples of such work include Ala-Mutka’s study on experiences on using Styleþþ tool to check Cþþ programming style over several courses (Ala-Mutka, 2005a) and Mitrovic’s study on different levels of feedback on SQL tutor (Mitrovic, Martin, & Mayo, 2000). Malmi et al. have presented a longitudinal study on using the TRAKLA/TRAKLA2 (Korhonen & Malmi, 2000; Malmi et al., 2004) in years 1993 – 2004, where they present how various changes of the tool functionality, and course arrangements affected the course results (Malmi, Karavirta, Korhonen, & Nikander, forthcoming). Korhonen et al. present the results of a large-scale intervention study where the performance of students using TRAKLA on the web was compared with students solving similar assignments in classroom sessions (Korhonen, Malmi, Myllysellka¨, & Scheinin, 2002). A similar intervention study was repeated with two groups in two universities in 2005 (Laakso, Salakoski, & Korhonen, 2005). In this paper we concentrate on analyzing how students apply the resubmission feature in an automatic assessment system. Compared with traditional teaching, most AA tools allow the learner to receive immediate feedback and resubmit the assignment several times within a single learning session. Thus, even if the solution is incorrect, the learner can continue working with the assignment straight away and resubmit an improved solution. This promotes reflective thinking, and can be a very valuable aid for learning (Korhonen, Malmi, Nikander, & Tenhunen, 2003). Our goal has been to investigate how this feature is used and how it affects the exercise results. We are especially interested in identifying different kinds of behaviors within the student population. Our target course is a data structures and algorithms course at Helsinki University of Technology, intended for first- and second-year students including computer science (CS) majors and minors who have already taken their first introductory programming course. The TRAKLA2 learning environment utilized in the course allows students to solve algorithm simulation exercises where students imitate the process of how the target algorithms change data structures. This is carried out by manipulating visualizations of data structures through a dedicated Graphical User Interface (GUI). Typically the simulation is carried out by context-sensitive drag&drop operations on nodes, edges, and keys. This sequence of operations is recorded and compared with a model sequence generated by an actual implemented algorithm. Finally, the feedback on the correctness of the sequence is given to the student. Each problem instance uses randomized data, and a model solution is available for inspection as an algorithm animation. The pedagogical goal of these exercises is to promote conceptual level understanding of various algorithms (including sorting, searching, and graph algorithms). After this we can proceed to implementation exercises. For more information about the tool and the target course, see Karavirta, Korhonen, and Malmi (2005), and Malmi et al. (2004). This research originates from our observations in which the current resubmission policy was applied on a course utilizing TRAKLA and TRAKLA2 systems. If the
Resubmissions in Automatic Assessment
231
number of resubmissions is not restricted, some learners are tempted to solve the exercises by applying a trial-and-error strategy. They use a vast number of reiterations without pondering the source of the errors. Such a practice, which is also called bricolage (Ben-Ari, 2001), is very inefficient when solving, for example, programming exercises. In the worst case, it could promote bad programming practices in the future as well. In order to prevent such behavior, an obvious solution is to restrict the number of allowed submissions, which we applied in TRAKLA till 2003. Another strategy is to use randomized input for each exercise instance, a method applied in TRAKLA2. Thus, the learner has to start over the exercise with new initial data in each reiteration. If he/she iterates, for example, 15 times on the same exercise, he/she actually needs to solve 15 different instances of the same exercise without getting any direct benefit from the previous submissions. At the beginning, the strategy applied seemed to be very promising to prevent bricolage in TRAKLA2. Our assumption was that if students use the resubmission feature heavily, they also use considerably more time solving many instances of an exercise, and thus presumably learn better. We observed, as anticipated, that significant differences in the number of resubmissions used between learner groups exist. We investigated this issue more and performed a regression analysis that implied that there is only a relatively small, but significant, correlation (r ¼ 0.5682, p 5 0.001) between the total number of resubmissions and the total time used while solving the exercise tasks in 2004 (Karavirta et al., 2005). This made us concerned about the overall time used by the learner group that employed heavily the resubmission feature. In order to get a deeper insight into the overall phenomena, we used a clustering technique to identify these learner groups based on their behavior. In this paper, we report the differences among these groups based on the average number of resubmissions used, points achieved from the exercises, time on task, and examination points. This work is a repetition for the study reported in the Koli Calling conference in 2005 (Karavirta et al., 2005). The rest of the paper is organized as follows. Section 2 describes briefly the clustering method applied in analyzing the data. Section 3 in turn reports the results, and finally in Section 4 we conclude our findings. 2. Method Based on the observations that students apply the resubmission option in different ways, we decided to perform a deeper analysis. It seemed that the student population contains several different groups with different characteristic behaviors. We chose to apply cluster analysis techniques to identify these groups. The actual used algorithm is based on competitive learning that is a well-known neural network method for clustering (see Hertz, Krogh, & Palmer, 1991, and Karavirta et al., 2005 for more details). In competitive learning, for each input vector a winning unit is calculated by finding the unit closest to the input vector with regard to some criteria. All of the output units compete to be the winner that is adapted, thus the name competitive learning. Some other method for grouping could have been applied as well. We chose
232
V. Karavirta et al.
this, however, because it was straightforward to implement and is easy to repeat with new data in consecutive years. In this research, the clustering is based on the following two dimensions (input units): the overall number of resubmissions, and the total points gained from the TRAKLA2 exercises. The number of resubmissions was a natural choice since we are studying how resubmissions affect learning. TRAKLA2 points were used because students using equal amount of resubmissions get very different grades. Thus, these two parameters are a good way to separate different student groups from each other. In addition, both parameters are independent of student’s performance on other parts of the course. Thus, the research question is, can we predict the user performance on these other course tasks based on this information? We performed several tests to identify the number of clusters. We wanted the maximum number of groups that still can be separated from each other in terms of statistical tests, and finally selected the number of clusters to be five. Thus the number of codevectors (output units) is five. We have previously analyzed and reported on the data from spring 2004 (Karavirta et al., 2005). The analysis presented in the next section, however, is based on the results of the course in spring 2005. There are two significant changes. First, the resubmissions were not limited in any way in 2004. In 2005, however, we limited the resubmissions in the first exercise round to five submissions per exercise. Yet, for the rest of the rounds there were no limitations. Our hypothesis was that there would be a transfer effect: if the students are forced to think before submitting in the first round, we believed this would also hold on the rounds with no limitations. Second, only CS minor students could be selected to the 2005 setup as we had another study that prevented us from collecting coherent data from the CS majors. In 2004, we had both groups and a comparison that revealed some differences between these two groups. In the next section, we report the results from 2005 and the changes observed while comparing the results with the 2004 data. 3. Results The results of the clustering are shown in Figure 1. The x-axis denotes the number of submissions during the whole course, and the y-axis denotes the total number of points achieved (maximum 70 points). Each symbol corresponds to one student in the target population. The clusters are indicated by different symbols. We identified the cluster characteristics, hereafter labeled as PASSERS, ORDINARIES, ITERATORS, AMBITIOUS, and TALENTED. The idea is to give intuitive labels for the groups to make the comparisons easier to follow. However, some of the labels include our hypothesis on the learning style, motivation, and skills of the majority of students in the particular group.1 These features are further discussed below. PASSERS can be found in the lower left corner of the figure. They get clearly the lowest exercise points, and do not reiterate much to improve their grade. Our
Resubmissions in Automatic Assessment
233
Figure 1. CS Minor Learners, from 2005, clustered into five groups. The number of learners is 213, number of exercises is 29, and maximum TRAKLA2 points 70
assumption is that they do not put much effort in the course. Instead, they aim at passing the course with minimum effort. This assumption is justified by the fact that they seem to aim at the number of points that is barely enough for passing the exercises. Of course, this group of students do not need to be PASSERS in other courses. In fact, other courses might be the reason to not have enough time for this course. ORDINARIES are found above PASSERS on the left side of the figure. Thus, they achieve better results compared with PASSERS. However, they apparently do not aim at the best grade since they do not exploit the resubmission feature compared with, for example, the AMBITIOUS. They seem to be satisfied with a reasonably good grade achieved without too much effort. Thus, ORDINARIES refer to a kind of average student that do pretty much what they are expected. ITERATORS are found in the top-right corner. They use the resubmission feature heavily, much more than any other group. They finally get good exercise points, but not all of them reach full points regardless of the large number of submissions.
234
V. Karavirta et al.
However, their most dominating characteristic is the large number of resubmissions used. TALENTED are students found at the top-left corner. They use fewest iterations and still achieve the best results. It is likely that they are brilliant students that do well whatever they do, or that they have well-developed learning practices in which they first think the problem through and act only thereafter. This is justified by the fact that they seem to know what they are doing: they get the best grade, but do not need the resubmissions much. Of course, it is possible that they have used considerable amount of time to learn the topic before they solve the exercises. However, they seem to be able to solve the exercises quickly and effectively. AMBITIOUS is the final group, which is located at the top of the diagram in the middle, thus they fall between the TALENTED and the ITERATORs. The label comes from the fact that they seem to aim at very good exercise points (and overall course grade), which they also finally achieve. However, they use considerably fewer resubmissions than ITERATORS, but more than ORDINARIES to get the better results. Yet, they are not as good as TALENTED in their first submissions as they need more resubmissions. Many of the PASSERS get less than the minimum amount of TRAKLA2 points (35 points) needed to pass the exercises and the course. This is possible since the students are allowed to make up the missing TRAKLA2 points after the deadline(s) by solving extra exercises. Thus, we assume they still believe in passing the course since they take the final examination. The points from the ‘‘make-up’’ rounds are not included in the analysis due to their different grading policy. A deeper analysis of the student groups revealed more evidence that the clusters really differ from each other. We studied the groups in respect of one independent variable: final examination points, and one dependent variable: time-on-task, i.e. the overall time spent on the exercise rounds. In addition, we have provided the following data from 2004 study to show one important difference between the CS major and minor students that might influence a bias in our data. 3.1. CS Majors vs. CS Minors in 2004 This analysis cannot be done for 2005 data as the CS majors did partly different exercises, and thus the results are not comparable. However, we want to repeat these results from the paper (Karavirta et al., 2005) in order to provide more background information for the conclusions. Formally, the two groups have separate parallel courses. However, in 2004, the lectures and the TRAKLA2 assignments were the same for both courses. The distributions of students between the two courses is shown in Table 1. The group TALENTED contains considerably more, whereas the AMBITIOUS group has less CS major students than their share of the total population would imply. It is probable that prior knowledge on the topic has some role to play here.
Resubmissions in Automatic Assessment
235
3.2. Final Examination The average points in the final examination after the course are shown in Table 2. Typically (as in 2004), the course has only one final examination right after the course (and several other opportunities later in the year). However, in 2005, we had two mid-term examinations that are summed up in the table. Thus, in the following, final examination refers to these two mid-term examinations. The examinations take place in a controlled lecture hall, and contain several different types of assignments: definitions of concepts, explaining the working of algorithms, small algorithm analysis and design tasks, as well as questions about choosing algorithms for some application. In addition, there are algorithm simulation tasks, but they are a small part of the whole examination. Students are not allowed to have any extra material with them. The box plot in Figure 2 illustrates the differences among the groups by showing the medians and 25% as well as 75% quartiles. The linear correlation between TRAKLA2 points and examination points for all groups is very significant (r ¼ 0.55, p 5 0.001).2 Statistical analysis of the examination result distributions show that PASSERS’ average points are significantly lower (t-test, p 5 0.01) than the average points of
Table 1. The amount of CS majors and CS minors in the 2004 courses. The percentages imply the part of majors/minors within each student group Student group
N
PASSERS ORDINARIES ITERATORS AMBITIOUS TALENTED
24 72 45 98 117
ALL
356
CS Majors
CS Minors
10 25 14 29 57
14 47 31 69 60
(41.7%) (34.7%) (31.1%) (29.6%) (48.7%)
135 (37.9%)
(58.3%) (65.3%) (68.9%) (70.4%) (51.3%)
221 (62.1%)
Table 2. The final examination points for CS minors in 2005 in each group. The number of maximum points is 84 (34 in 2004). The relative size of the group is in parenthesis Year
2005
Student group
N
PASSERS ORDINARIES ITERATORS AMBITIOUS TALENTED
19 37 5 58 94
ALL
213
(8.9%) (17.4%) (2.3%) (27.2%) (44.1%)
2004 Avg 23.9 37.2 39.1 44.4 50.9 39.1
N 24 72 45 98 117 356
(6.7%) (20.2%) (12.6%) (27.5%) (32.8%)
Avg 10.0 16.2 15.9 16.9 20.6 17.4
236
V. Karavirta et al.
Figure 2. Boxplots for examination points of each group. The box has lines at the lower quartile, median, and upper quartile values. The whiskers show the extent of the data. Values beyond the whiskers are considered outliers. The notches represent the 95% data range
groups ORDINARIES, TALENTED, and AMBITIOUS. Moreover, the average points of the TALENTED group is significantly higher (t-test, p 5 0.01) compared with the three other groups. The number of ITERATORS, however, is too small. Based on only 2005 data, it would be too early to jump to a conclusion, but taking into account the data from 2004, we can say that the differences are also significant with ITERATORS. On the other hand, in 2004 there were no statistical differences among the groups ORDINARIES, ITERATORS, and AMBITIOUS (t-test, 0.33 5 p 5 0.84). In 2005, however, there is a significant difference between ORDINARIES and AMBITIOUS (t-test, p ¼ 0.01), which is probably related to the differences in the populations (in 2005 only CS minors were considered). 3.3. Time-on-task Figure 3 shows clear differences in time-on-task in 2005. ITERATORS use most time and AMBITIOUS use somewhat less, but much more than ORDINARIES, TALENTED and PASSERS. In the year 2004 data (Karavirta et al., 2005), the differences among these groups are significant (t-test, p 5 0.01). In addition, there are no significant differences among the groups ORDINARIES, TALENTED and PASSERS. However, in 2005, some of the differences are only almost significant (0.001 5 p 5 0.014) due to the small number of ITERATORS. In addition, the
Resubmissions in Automatic Assessment
237
results between AMBITIOUS and TALENTED are somewhat different in 2005, but not significantly (t-test, p ¼ 0.104) if we compare the results of these groups from the two years. This is probably due to the fact that the CS majors are missing from the groups in 2005. We know from our prior study that there are relatively more CS majors in TALENTED than in AMBITIOUS, and with CS minors this is just the opposite. Without the CS majors, the two groups come closer to each other, thus the effect is not as clear. The correlation between the number of resubmissions and the total time used is evident. If we consider all clusters, the correlation in 2005 is significant (r ¼ 0.31, p 5 0.001). Furthermore, there are significant differences among the groups if we look at the average time per one resubmission. Table 3 depicts the total time used for
Figure 3. Boxplot of time used for each group in 2005. One outlier in group ITERATORS has been left out of the figure Table 3. Overall time (in minutes) spent, the number of resubmissions, and the average time per submission (in minutes) Student group
Time (avg.)
Resubm. (avg.)
Subm. time (avg.)
PASSERS ORDINARIES ITERATORS AMBITIOUS TALENTED
693 781 1536 1010 869
58 67 271 122 65
11.9 11.7 5.7 8.3 13.4
892
85
10.2
ALL
238
V. Karavirta et al.
the exercise rounds. We can see that TALENTED use more than twice the time per resubmission than ITERATORS and 60% more than AMBITIOUS. Compared with 2004, the results are similar if we take into account the little bit different clustering between ITERATORS and TALENTED in 2005 (very few ITERATORS, much more TALENTED). This suggests that ITERATORs resubmit their work more carelessly than the others. 4. Conclusions The clustering method used in this research provides an instrument to predict and identify several learner groups in the target course. The data analysis shows that there are two extreme groups that could be identified quite early on the course: PASSERS and TALENTED. In addition, their performance in the final examination is very different. The data supports the view that TALENTED have the skill and the motivation to do well, whereas PASSERS probably just wish to pass the compulsory course with minimum effort. On the other hand, the three other groups, ITERATORS, AMBITIOUS, and ORDINARIES, do almost equally well in the examination, although their behavior in solving the exercises differs considerably. ITERATORS put a lot of time and effort into the exercises. It is evident that they use more resubmissions than the other groups, but this does not seem to improve their examination grade. AMBITIOUS have similar motivation for getting a good exercise grade, but they seem to think more before they submit an exercise anew. The average time between two consecutive resubmissions is considerably higher for them than for ITERATORS. Finally, it is interesting that the ordinary students seem to optimize their efforts the most. They get equal grades in the examination without aiming at the best grades in the exercises. The exercise grade has only 30% weight on the final course grade, thus they need not get the best exercise grade to get a satisfactory final grade. Our experience during the transition from TRAKLA (limited number of submission) to TRAKLA2 (almost unlimited resubmissions) was that the average number of resubmissions increased considerably. At the same time, the percentage of students gaining maximum points increased significantly. We did not worry about the increased number of resubmissions since the exercise is reinitialized with new random data in each reiteration. We thought that they would learn better by rethinking the exercise before they can resubmit. This analysis has revealed that our assumption does not hold for all the learners: ITERATORS work inefficiently.3 The results suggest that we should guide ITERATORS’ learning process by limiting the number of resubmissions, at least to some extent. If the limit is high enough, the other groups are not affected because they use fewer resubmissions in the first place. There are several options to do that, for example by limiting the number of allowed resubmissions for each exercises or the total number of resubmissions for all the exercises. However, the results also support another strategy to aim at the very same goal. We call this transfer effect: limiting the number of allowed submissions in
Resubmissions in Automatic Assessment
239
the beginning of the course seems to lower the number of resubmissions on the rest of the course as well. The number of ITERATORS was significantly lower in 2005 (about 2%) when we limited the resubmissions on the first exercise round compared with 2004 (about 13%) in which there were no limitations. Even though the clustering method has a role to play here, we are quite satisfied with the low number of iterators in 2005 and are going to continue this convention. In the future, there are many interesting options to continue the research work and refine the clustering instrument. First, we need more research to justify our assumptions on the behavior of students in each cluster. We could interview a number of students and use questionnaires to find out whether our hypothesis of the motivations are correct. Second, we could gather more background information about student groups, such as learning styles (Felder & Silverman, 1988), goal orientations or resubmission behavior in other programming courses. These studies could reveal new information on student behavior. Finally, adjusting the resubmission policy is not the only alternative to guide the students into a better learning style. Improving the feedback to better adapt to the learners’ needs could reduce the inefficient reiteration. Acknowledgment This work was supported by the Academy of Finland under grant number 210947. Notes 1. We do not claim that all students within a group always hold the given characteristics. 2. This correlation is higher than in 2004, which reflects more the type of final examination than the learner groups. 3. We believe that at least some ITERATORS use trial-and-error strategy to solve the exercises. Another explanation for the large amount of submissions could be, for example, that the learners are practicing for the examination. However, the data used here includes only submissions made before the exercise deadline. Typically, the examinations are after the deadlines.
References Ala-Mutka, K. (2005a). Automatic assessment tools in learning and teaching programming. Doctoral Dissertation (Tampere University of Technology, Publication 559). Institute of Software Systems, Tampere University of Technology. Ala-Mutka, K. (2005b). A survey of automated assessment approaches for programming assignments. Computer Science Education, 15, 83 – 102. Ben-Ari, M. (2001). Constructivism in computer science education. Journal of Computers in Mathematics and Science Teaching, 20(1), 45 – 73. Carter, J., English, J., Ala-Mutka, K., Dick, M., Fone, W., Fuller, U., & Sheard, J. (2003). ITICSE Working Group Report: How shall we assess this? SIGCSE Bulletin, 35, 107 – 123.
240
V. Karavirta et al.
English, J., & Siviter, P. (2000). Experience with an automatically assessed course. Proceedings of the 5th Annual SIGCSE/SIGCUE Conference on Innovation and Technology in Computer Science Education, ITiCSE’00 (pp. 168 – 171). Helsinki, Finland, New York: ACM Press. Felder, R.M., & Silverman, L.K. (1988). Learning styles and teaching styles in engineering education. Engineering Education, 78, 674 – 681. Forsythe, G.E., & Wirth, N. (1965). Automatic grading programs. Communications of the ACM, 8, 0001-0782, ACM Press (pp. 275 – 278). http://doi.acm.org/10.1145/364914.364937 Hertz, J., Krogh, A., & Palmer, R.G. (1991). Introduction to the theory of neural computation. Boston: Addison-Wesley Publishing Company. Karavirta, V., Korhonen, A., & Malmi, L. (2005). Different learners need different resubmission policies in automatic assessment systems. Proceedings of the 5th Annual Finnish/Baltic Sea Conference on Computer Science Education, November, University of Joensuu, 95 – 102. Korhonen, A., & Malmi, L. (2000). Algorithm simulation with automatic assessment. Proceedings of the 5th Annual SIGCSE/SIGCUE Conference on Innovation and Technology in Computer Science Education, ITiCSE’00 (pp. 160 – 163). Helsinki, Finland, New York: ACM Press. http://www.cs.hut.fi/Research/SVG/publications/algo_simu_with_auto_assessment.pdf Korhonen, A., Malmi, L., Myllyselka¨, P., & Scheinin, P. (2002). Does it make a difference if students exercise on the web or in the classroom? Proceedings of the 7th Annual SIGCSE/ SIGCUE Conference on Innovation and Technology in Computer Science Education, ITiCSE’02 (pp. 121 – 124). Aarhus, Denmark, New York: ACM Press. http://www.cs.hut.fi/Research/ SVG/publications/web_or_classroom.pdf Korhonen, A., Malmi, L., Nikander, J., & Tenhunen, P. (2003). Interaction and feedback in automatically assessed algorithm simulation exercises. Journal of Information Technology Education, 2, 241 – 255. http://www.cs.hut.fi/Research/SVG/publications/interaction&feedback_in_aaas_ exercises.pdf Laakso, M.-J., Salakoski, T., & Korhonen, A. (2005). The feasibility of automatic assessment and feedback. Proceedings of Cognition and Exploratory Learning in Digital Age (CELDA 2005), Porto, Portugal (pp. 113 – 122). Luck, M., & Joy, M. (1999). A secure on-line submission system. Software – Practice and Experience, 29, 721 – 740. Malmi, L., Karavirta, V., Korhonen, A., & Nikander, J. (forthcoming). Experiences on automatically assessed algorithm simulation exercises with different resubmission policies. Journal of Educational Resources in Computing, Accepted for publication. Malmi, L., Karavirta, V., Korhonen, A., Nikander, J., Seppa¨la¨, O., & Silvasti, P. (2004). Visual algorithm simulation exercise system with automatic assessment: TRAKLA2. Informatics in Education, 3(2), 267 – 288. Mitrovic, A., Martin, M., & Mayo, M. (2000). Using evaluation to shape ITS design: results and experiences with SQL-tutor. User Modeling and User-Adapted Interaction, 12, 243 – 279. Naur, P. (1964). Automatic grading of students’ ALGOL programming. BIT 4, 177 – 188. Rees, M.J. (1982). Automatic assessment aids for pascal programs. SIGPLAN Notices, 17, 33 – 42. Saikkonen, R., Malmi, L., & Korhonen, A. (2001). Fully automatic assessment of programming exercises. Proceedings of The 6th Annual SIGCSE/SIGCUE Conference on Innovation and Technology in Computer Science Education, ITiCSE’01, Canterbury, UK, ACM Press, New York (pp. 133 – 136). Tscherter, V., Lamprecht, R., & Nievergelt, J. (2002). Exorciser: Automatic generation and interactive grading of exercises in the theory of computation. Fourth International Conference on New Educational Environments, 47 – 50.